Closed dhoogest closed 5 years ago
Well, we'd have problems if we rely on the assumption that there is only one "scientific name" for each tax_id. Let me have a look at the actual duplicates. It may actually be possible to leave the index as is and modify the source data as it is read in to ensure uniqueness.
unique name used instead of scientific name: e91cf0701ebd2e2596762ba1a58af483a6f2c297
Another issue with uniqueness in the dumpfile emerged end of Dec 2018. This time, an entire row was duplicated in the names table. Proposal is force uniqueness of lines during read of archive, as opposed to doing a (likely more complicated) upsert
@dhoogest - this is fixed, correct?
@dhoogest - never mind - we fixed the original problem with scientific names; I was looking for a separate issue for the row duplication.
will fix in 124-duplicate-dmp-rows
closed by 85130273466c287a7df1f827a736e5f4efcb035a (v0.8.9)
Changes (ongoing) to NCBI taxonomy are causing the primary key relationship on names to be violated, resulting in a failure when executing
taxit new_database
. According to feedback from NCBI, this is the result of an ongoing 'upgrade' to the taxonomy system, and it does appear that records are being corrected incrementally (they also indicated that a "new version with more information may need to be adapted in the future")Load of ncbi data from dump succeeds if the
id
primary column is restored, replacing the combined key oftax_id
,tax_name
, andname_class
, however I'm not sure if there are downstream ramifications of this approach.