CoVaRR-NET / duotang

Scripts and data for the CoVaRR-Net Pillar 6 notebook
https://covarr-net.github.io/duotang/duotang.html
MIT License
1 stars 2 forks source link

duplicate GISAID id in VirusSeq #16

Closed RaphaelRaphael closed 1 year ago

RaphaelRaphael commented 2 years ago

So the way I extract metadata in bash

filefromVirusSeq=9c65ea71-30bb-49b1-bb60-85dbc31173b3 tar -axf source/$filefromVirusSeq -O files-archive-$filefromVirusSeq.tsv | tr ' ' '' | sed 's/\t\t/\tNA\t/g' | sed 's/\t\t/\tNA\t/g' | sed 's/\t$/\tNA/g'> datafromvirrusseq$shortname

From here column 44 have 36 duplicates like this one : EPI_ISL_2479805

ArtPoon commented 2 years ago

I confirm two duplicates of EPI_ISL_2479805:

art@Wernstrom duotang % gunzip -c data_needed/virusseq.2022-03-16T15:17:45.metadata.tsv.gz | grep -c EPI_ISL_2479805
2
ArtPoon commented 2 years ago

Confirmed 35 cases of duplicates with R, and one really weird one:

> md <- read.csv("virusseq.2022-03-16T15:17:45.metadata.tsv", sep='\t')
> table(table(md$GISAID.accession))

     1      2   1809 
251826     35      1
ArtPoon commented 2 years ago

Okay that 1809 is due to blanks:

> which.max(table(md$GISAID.accession))

1 
> table(md$GISAID.accession=="")

 FALSE   TRUE 
251896   1809 
RaphaelRaphael commented 2 years ago

Oh yes, my bash command line above is filling the empty cells with NA it help a lot

bfjia commented 1 year ago

@anwarMZ

bfjia commented 1 year ago

Extension of Issue #42

bfjia commented 1 year ago

@anwarMZ can this issue be closed now?

RaphaelRaphael commented 1 year ago

Before closing it, I would like to know how to interpret the output of the download script virusseq.gisaid_duplicate_ids.txt could it be more explicit and can it be documented in CONTRINUTING.md ?