fhcrc / taxtastic

Create and maintain phylogenetic "reference packages" of biological sequences.
GNU General Public License v3.0
21 stars 10 forks source link

ncbi new_database failing unique constrain on `names` #124

Closed dhoogest closed 5 years ago

dhoogest commented 5 years ago

Changes (ongoing) to NCBI taxonomy are causing the primary key relationship on names to be violated, resulting in a failure when executing taxit new_database. According to feedback from NCBI, this is the result of an ongoing 'upgrade' to the taxonomy system, and it does appear that records are being corrected incrementally (they also indicated that a "new version with more information may need to be adapted in the future")

Load of ncbi data from dump succeeds if the id primary column is restored, replacing the combined key of tax_id,tax_name, and name_class, however I'm not sure if there are downstream ramifications of this approach.

nhoffman commented 5 years ago

Well, we'd have problems if we rely on the assumption that there is only one "scientific name" for each tax_id. Let me have a look at the actual duplicates. It may actually be possible to leave the index as is and modify the source data as it is read in to ensure uniqueness.

dhoogest commented 5 years ago

unique name used instead of scientific name: e91cf0701ebd2e2596762ba1a58af483a6f2c297

dhoogest commented 5 years ago

Another issue with uniqueness in the dumpfile emerged end of Dec 2018. This time, an entire row was duplicated in the names table. Proposal is force uniqueness of lines during read of archive, as opposed to doing a (likely more complicated) upsert

nhoffman commented 5 years ago

@dhoogest - this is fixed, correct?

nhoffman commented 5 years ago

@dhoogest - never mind - we fixed the original problem with scientific names; I was looking for a separate issue for the row duplication.

nhoffman commented 5 years ago

will fix in 124-duplicate-dmp-rows

nhoffman commented 5 years ago

closed by 85130273466c287a7df1f827a736e5f4efcb035a (v0.8.9)