Closed blakesweeney closed 1 year ago
Note that there are some sequences (~400 sequences) which don't yet work because the taxid was deleted after we imported the sequence but before we imported taxonomic information. I can fix this however.
This isssue is stale, and partially completed. I will close in favor of newer smaller ones.
I've created a table called
rnc_taxonomy
that stores the taxonomic information for an NCBI taxid. It is very simple and I'm still fixing an issue with importing data (it has issues importing all some organisms because of quotes in their aliases). But that only effects a few taxids we use (~140). The table has 4 columns:id
: The primary key which is an NCBI taxid as an integer.name
: The scientific name for that taxid.lineage
: The lineage (cellular organisms; Archaea; Euryarchaeota...
) for the taxid.aliases
: Array of other recognised names for that taxid.replaced_by
: If this taxid has been merged into another one then this will show the new taxid, otherwise null.The table will contain all taxids, not just species or any other level. In addition, it will track if a taxid is merged into another one with the
replaced_by
column. When a taxid has been merged the name and lineage information will reflect the taxid it has been merged into. We can use this table instead of theclassification
andspecies
columns inrnc_accessions
.For now I will fix the issues with importing and continue to populate each time we run the pipeline. I will also use it for search export so we can hopefully allow searches of
E. coli
to actually find the expected species. Once @BurkovBA says we have switched over to using it in the webcode I will delete the relevant columns fromrnc_accessions
.