CDCgov / datasets-sars-cov-2

Benchmark datasets for WGS analysis of SARS-CoV-2. (https://peerj.com/articles/13821/)
Apache License 2.0
54 stars 18 forks source link

Internally curated consensus sequences for lineages #30

Open gcha31 opened 1 year ago

gcha31 commented 1 year ago

Hi TOAST team,

I am interested in curating the representative genomes for VOCs/VBMs. And according to your recent publication (Xiaoli, Lingzi, et al. "Benchmark datasets for SARS-CoV-2 surveillance bioinformatics." PeerJ 10 (2022): e13821.), your dataset 4&5 were prepared based on alignments to the 'internally curated consensus sequences'. May I ask for the details about how you curated those internally? Thank you.

Best, Gyuhyon

lskatz commented 1 year ago

Hi there,
We had contacted the SSEV team at CDC for the representative genomes. We are told that they were gathered in a two step process:

  1. Pull all representatives listed in the pangolin repository https://github.com/cov-lineages/pango-designation/blob/master/curation_notes/curation_notes.tsv
  2. If a representative sequence is not present, pull the longest/cleanest available sequence from the lineage (least amount of mixed bases and missing data) with the earliest collection date