Closed RaphaelRaphael closed 1 year ago
I confirm two duplicates of EPI_ISL_2479805
:
art@Wernstrom duotang % gunzip -c data_needed/virusseq.2022-03-16T15:17:45.metadata.tsv.gz | grep -c EPI_ISL_2479805
2
Confirmed 35 cases of duplicates with R, and one really weird one:
> md <- read.csv("virusseq.2022-03-16T15:17:45.metadata.tsv", sep='\t')
> table(table(md$GISAID.accession))
1 2 1809
251826 35 1
Okay that 1809
is due to blanks:
> which.max(table(md$GISAID.accession))
1
> table(md$GISAID.accession=="")
FALSE TRUE
251896 1809
Oh yes, my bash command line above is filling the empty cells with NA it help a lot
@anwarMZ
Extension of Issue #42
@anwarMZ can this issue be closed now?
Before closing it, I would like to know how to interpret the output of the download script virusseq.gisaid_duplicate_ids.txt
could it be more explicit and can it be documented in CONTRINUTING.md ?
So the way I extract metadata in bash
filefromVirusSeq=9c65ea71-30bb-49b1-bb60-85dbc31173b3 tar -axf source/$filefromVirusSeq -O files-archive-$filefromVirusSeq.tsv | tr ' ' '' | sed 's/\t\t/\tNA\t/g' | sed 's/\t\t/\tNA\t/g' | sed 's/\t$/\tNA/g'> datafromvirrusseq$shortname
From here column 44 have 36 duplicates like this one : EPI_ISL_2479805