NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

nonunique long form accessions prevent loading data #167

Closed mpoelchau closed 5 years ago

mpoelchau commented 5 years ago

I imported bioproject 412476, and selected "create linked records". I then successfully published the Project record. The Biological Sample record, however, was not imported. Publications and organism were also not imported.

I tried again, with bioproject 167477. This time the one of 2 biosamples also imported, but not much metadata was imported - just the accession number as the name. One empty publication record was imported. No organism.

Am I doing things in the wrong order?

bradfordcondon commented 5 years ago

short answer: no you've just got a knack for finding edge cases.

So i dont know what to do in these cases where accessions returned are non-unique, provided we cant find a way to make them unique.

412476: we expect 3 biosamples, 1 organism, 1 publication from the preview.

upon running the importer:

Inserting record into Chado: bioproject: 412476
[site http://default] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for biosample:SAMD00093090

So, it failed loading the first biosample because the accession was non-unique.

with project 167477, we expect 2 biosamples, assembly, an organism, and a pub

Calling: tripal_eutils_create_records(bioproject, 167477, 1)
INFO (TRIPAL_EUTILS): Inserting record into Chado: bioproject: 167477
INFO (TRIPAL_EUTILS): Inserting record into Chado: biosample: 2434893
INFO (TRIPAL_EUTILS): Inserting record into Chado: biosample: 2649412
[site http://default] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for assembly:GCA_000648675

so in both cases, we get an error because we cant identify that accession uniquely.

multiple assembly

if we search for the assembly:

https://www.ncbi.nlm.nih.gov/assembly/?term=GCA_000648675 in this cas,e there are "anomalous results". presumably we can add filter parameters to the query and we'd get a single result.

screen shot 2019-02-13 at 5 27 25 pm

multiple biosamples

for the biosample:

https://www.ncbi.nlm.nih.gov/biosample/?term=SAMD00093090

interesitly, only 1 result here via the GUI. how about the API?

GET /entrez/eutils/esearch.fcgi/?db=biosample&retmode=xml&term=SAMD00093090 HTTP/1.0 via the API we get two results:

<Id>7714098</Id>
<Id>7714100</Id>

so two samples: SAMD00093521 and SAMD00093090. Cool, 93521 is a pool of 93090 (female) and another sample, the male sample. However, NEITHER XML FILE includes this relationship in a machine readable way. 93521 describes it in the text only. 93090 doesnt even include it in the text! How does the server even know to return 93521 if i search with 93090? It must have hte information stored somewhere!

bradfordcondon commented 5 years ago

ok, NCBI does indeed let us specify the field, just not on a per parameter basis. Just what we need.

  `$provider->addParam('field', 'accession');` . 

So SAMD00093090 now returns a single result.

As for the multiple assemblies, we need to add filters in a similar manner.