RNAcentral / rnacentral-import-pipeline

RNAcentral data import pipeline
Apache License 2.0
2 stars 1 forks source link

Do not use common names from rnc_accessions #152

Closed blakesweeney closed 2 years ago

blakesweeney commented 2 years ago

Fixes https://github.com/RNAcentral/rnacentral-webcode/issues/75. I'm fairly certain the issue with weird common names stems from the parsing that is used to build the rnc_accessions table. For some databases (probably only ENA) we parse the names we are given to extract a common name. This is unreliable. But we don't need to fix that (if it is even possible), because we have a better source taxonomy information, rnc_taxonomy. That table is based of the NCBI taxonomy and uses the official names for species, no need for parsing to extract common names. Thus, we stop using the common_name from rnc_accessions and just use the one from rnc_taxonomy.

What this does not change is the usage of the species or lineage from the rnc_accessions table. This should be done, eventually, but before that we have to ensure there is always an entry in rnc_taxonomy for all taxids in xref/accession. This may not be true currently, and is not yet enforced, so I leave those lines alone.