CDCgov / datasets-sars-cov-2

Benchmark datasets for WGS analysis of SARS-CoV-2. (https://peerj.com/articles/13821/)
Apache License 2.0
54 stars 18 forks source link

VOI/VOC table is missing accessions for consensus genomes #5

Closed proychou closed 3 years ago

proychou commented 3 years ago

As I mentioned on SPHERES Slack, the table for VOI/VOC https://github.com/CDCgov/datasets-sars-cov-2/blob/master/datasets/sars-cov-2-voivoc.tsv is missing accessions for the consensus sequence.

Screen Shot 2021-10-05 at 5 27 08 AM

Would be great to add these Genbank/GISAID accessions and also possibly the PANGO lineage assigned at the time these were generated, along with version. But accessions would be most crucial to add.

lskatz commented 3 years ago

So for example in the VOI/VOC dataset, we have a sample name hCoV-19_Wales_PHWC-4C8F5E_2021 which would correspond to the GISAID sample with the same name but substitute _ for /, ie, hCoV-19/Wales/PHWC-4C8F5E/2021. Does that answer this issue?

proychou commented 3 years ago

Doesn't really help because we'd still need to look these up one by one in GISAID after doing those substitutions. Alternatively, if one has access they could query the larger GISAID metadata file. It seems like unnecessary steps though, and not everyone has access to those datasets. However, if accessions were provided, those could be entered directly into the GISAID search tool which all users have access to.

It's also inconsistent between the voc and non-voc set. The latter does have the Genbank accessions. Seems like the best solution would just be to provide Genbank/GISAID accessions for both, no?

lskatz commented 3 years ago

Hi @proychou does @daisy0223 's latest address the issue?

proychou commented 3 years ago

Yes, perfect! Thank you!!

lskatz commented 3 years ago

Okay great. Thank you for your feedback and helping us make these datasets better!