TAMU-CPT / training-material

A collection of Galaxy-related training material
https://training.galaxyproject.org
Other
3 stars 9 forks source link

FAQ entry on duplicate entries in comparative Top hits output #67

Closed jrr-cpt closed 4 years ago

jrr-cpt commented 4 years ago

Problem: Top protein or nucleotide hit lists contain duplicate entries. For example

Duplicate protein top hits

Cause: The processing of top hits from the BLASTp job will separately count organisms that have a unique accession, or multiple TaxIDs. For organisms with a representative genome in NCBI's RefSeq collection, this will result in duplicate organisms with identical TaxIDs, but unique accessions. For some organisms with many representative genomes in the database, they will have been assigned multiple TaxIDs, each with a unique accession. Both these cases will result in what appear to be duplicates in the Top hits list. The user should verify that entries are in fact, representing the same organism. The number of top hits displayed in the output list can be adjusted by the user when running the relatedness tool.

jrr-cpt commented 4 years ago

Add this to illustrate cause

TaxIDsvAccessions