NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

unable to find UID for some bioprojects #171

Closed bradfordcondon closed 5 years ago

bradfordcondon commented 5 years ago

Some of the mega bioprojects could not be found. Probably a similar issue to #167

 Unable to find UID for bioproject:PRJNA243935
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA230921
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA203545
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA203209
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA203087
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA171756
Unable to find UID for bioproject:PRJNA171749
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA171748
[site http://tripal.test] [TRIPAL ERROR] [TRIPAL_EUTILS] Unable to find UID for bioproject:PRJNA168121
bradfordcondon commented 5 years ago

case 1:

https://www.ncbi.nlm.nih.gov/bioproject/243935 vs https://www.ncbi.nlm.nih.gov/bioproject/342675

The latter is PRJNA243935 obviously, and theo ther has a different accession (PRJNA342675) - so why isnt it caught in our accession filter?

answer:

</IdList><TranslationSet/><TranslationStack>   <TermSet>    <Term>PRJNA243935[All Fields]</Term>    <Field>All Fields</Field>    <Count>2</Count>    <Explode>N</Explode>   </TermSet>   <OP>GROUP</OP>  </TranslationStack><QueryTranslation>PRJNA243935[All Fields]</QueryTranslation></eSearchResult>

For some reason, it doesnt have the accession filter.

And yet, we definitely pass in the accession field filter?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi/?db=bioproject&api_key=47e0b8233ccc7a173d2159aebd9e3ecf4009&retmode=xml&term=PRJNA243935&field=accession

The query is correct. bioproject db must not support the field parameter.

Instead it needs to be PRJNA243935[Project Accession]

resolution:

 if ($db == 'bioproject') {
        // Bioproject doesnt support field filter, so instead, do this.
        $provider->addParam('term', $accession . '[Project Accession]');
      }
      else {
        $provider->addParam('term', $accession);
        $provider->addParam('field', 'accession');
      }

Note that if we keep adding all these exceptions by database, we're going to want to handle it more elegantly.

bradfordcondon commented 5 years ago

this actually resolves all the failed projects!

for the future, if we go to the advanced search, https://www.ncbi.nlm.nih.gov/bioproject/advanced it will let us see what fields we are allowed to filter by! It's different for each db, and a bit inconsistent (ie bioproject -> project accession, bioproject => accession, assembly => Assembly Accession

bradfordcondon commented 5 years ago

resolved in #172