NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

Assembly xrefs vs primary accessions vs linked records #194

Closed bradfordcondon closed 5 years ago

bradfordcondon commented 5 years ago

child of #192

Assembly parser currently has the following in $info['accessions']['assembly']:

in #192 we had a case where importing GCA_000188095.3 resulted in all 4 of these being imported with different values... but all of them point back to the same UID!

Which are accessions? which are linked records and therefore separate analsyes that need to be created? WGS- i dont think thats an asembly at all as its in nucleotide. we dont have a "dbxref" section of the assembly parser. we should, and thats where it should go.

so for GCA_000188095.3

ACCESSION TYPE VALUE CREATED
Assembly 6127518 YES . this is the refseq UID
Assembly 6049248 YES . this is the genbank UID
Assembly GCF_000188095.2 YES this is the refseq accession
Assembly AEQM02 this is the wGS accession. I think maybe it should be in its own section (ie just a dbxref, not a linked record)
mpoelchau commented 5 years ago

As far as I can tell, the GenBank assembly is the same data as the WGS accession, just with different accession numbers. I am not sure why they are separate entities. I suspect that WGS is older, and assembly was added on with some new functionality, based on skimming this article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702866/

So - I agree with you that WGS should be a Dbxref of the GenBank assembly (if possible).

I don't think we can assume that GenBank and RefSeq are identical, so these probably aren't dbxrefs.

bradfordcondon commented 5 years ago

I don't think we can assume that GenBank and RefSeq are identical, so these probably aren't dbxrefs.

ok i agree and that makes sense. But that also means I dont know what to do with them, because I dont think I can tell which is referring to the accession being imported and which is a link to another accession.

after some rooting around, we always find that hte uids for the refseq and genbank just redirect you to the parent UID. theres no difference between them. as such, i think all of them DO qualify as xrefs, paradoxically.

bradfordcondon commented 5 years ago

these are all now xrefs as of #199 .

screen shot 2019-02-26 at 5 57 11 pm

we MIGHT shoot ourselves in the foot if the assembly references a different, distinct, assembly. but i havent seen a case where thats actually what is going on.