Closed AngieHinrichs closed 3 years ago
Fixed it for tomorrow's run in commit dd2c615. I had deliberately set it not to exclude rows of the metadata which have been labelled to be excluded so that the "why_excluded" column could be published in some pipeline outputs. But when we are not publishing the "why_excluded" column, we DO need to exclude them, otherwise have no way of knowing which corresponds to the fasta sequence entry after deduplication.
Awesome, thanks @rmcolq!
Some sequence names in cog_metadata.csv appear multiple times. Here's a quick way to find them:
In many cases, the duplicate entries have the same lineage. Here's a way to get the names of the duplicates with consistent lineage:
Then we can get the names of duplicates with different lineage assignments:
And then a file with the duplicates and the different lineages assigned to them:
In some cases when the same lineage is assigned, the assignment details are different:
I've attached the files generated by those commands from cog_metadata.csv downloaded earlier today (2021-07-07), but with .txt appended to the names dupIdDiffLins.txt dupDiffLinIds.txt dupIdLinIds.txt dupIds.txt so github will accept them.
Thanks!