COG-UK / datapipe

Nextflow implementation of grapevine
GNU General Public License v3.0
8 stars 3 forks source link

Duplicate entries in cog_metadata.csv, some with differing lineages #31

Closed AngieHinrichs closed 3 years ago

AngieHinrichs commented 3 years ago

Some sequence names in cog_metadata.csv appear multiple times. Here's a quick way to find them:

tail -n+2 cog_metadata.csv | cut -d, -f1 | sort | uniq -c | awk '$1 > 1 {print $2;}' > dupIds

In many cases, the duplicate entries have the same lineage. Here's a way to get the names of the duplicates with consistent lineage:

tail -n+2 cog_metadata.csv | cut -d, -f1,7 | sort | uniq -c | awk '$1 > 1 {print $2;}' | cut -d, -f 1 > dupIdLinIds

Then we can get the names of duplicates with different lineage assignments:

comm -23 dupIds dupIdLinIds > dupDiffLinIds

And then a file with the duplicates and the different lineages assigned to them:

tail -n+2 cog_metadata.csv | cut -d, -f1,7 | grep -Fwf dupDiffLinIds | sort > dupIdDiffLins
head -2 dupIdDiffLins
England/ALDP-12BA668/2021,B.1.1.7
England/ALDP-12BA668/2021,None

In some cases when the same lineage is assigned, the assignment details are different:

grep ALDP-12BA659 cog_metadata.csv | cut -d, -f 1-13
England/ALDP-12BA659/2021,UK,UK-ENG,Y,2021-02-14,60,B.1.1.7,PLEARN-v1.2.13,0.0,0.9901517473942366,Alpha (B.1.1.7-like),0.956500,0.043500
England/ALDP-12BA659/2021,UK,UK-ENG,Y,2021-02-14,60,B.1.1.7,PANGO-v1.2.13,,,Alpha (B.1.1.7-like),0.782600,0.130400

I've attached the files generated by those commands from cog_metadata.csv downloaded earlier today (2021-07-07), but with .txt appended to the names dupIdDiffLins.txt dupDiffLinIds.txt dupIdLinIds.txt dupIds.txt so github will accept them.

Thanks!

rmcolq commented 3 years ago

Fixed it for tomorrow's run in commit dd2c615. I had deliberately set it not to exclude rows of the metadata which have been labelled to be excluded so that the "why_excluded" column could be published in some pipeline outputs. But when we are not publishing the "why_excluded" column, we DO need to exclude them, otherwise have no way of knowing which corresponds to the fasta sequence entry after deduplication.

AngieHinrichs commented 3 years ago

Awesome, thanks @rmcolq!