NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

discrep between formatter and inserted records #192

Closed bradfordcondon closed 5 years ago

bradfordcondon commented 5 years ago

imported GCA_000188095.3. Said it would import 4 assemblies, 1 organism, 2 bioprojects, 1 biosample. Tripal published 3 bioprojects (1 may have been a leftover project form before?), 0 biosamples, 2 analyses

this is an assembly

bradfordcondon commented 5 years ago
ACCESSION TYPE VALUE CREATED
Assembly 6127518 YES . this is the refseq UID
Assembly 6049248 YES . this is the genbank UID
Assembly GCF_000188095.2 YES this is the refseq accession
Assembly AEQM02 this is the wGS accession. I think maybe it should be in its own section (ie just a dbxref, not a linked record)
Organism 132113 YES but its silent. Let's log
Bioprojects 61101 YES
Bioprojects 70395 YES
Biosamples 2953787 NO but should be

screen shot 2019-02-26 at 11 19 13 am

Calling: tripal_eutils_create_records(assembly, GCA_000188095.3, 1)
INFO (TRIPAL_EUTILS): Inserting record into Chado: assembly: 1571891
INFO (TRIPAL_EUTILS): Inserting record into Chado: bioproject: 61101
INFO (TRIPAL_EUTILS): Inserting record into Chado: pubmed: 25908251
INFO (TRIPAL_EUTILS): Inserting record into Chado: pubmed: 9023104
INFO (TRIPAL_EUTILS): Inserting record into Chado: bioproject: 70395
INFO (TRIPAL_EUTILS): Inserting record into Chado: pubmed: 21482769
bradfordcondon commented 5 years ago

assemblies

4 assemblies vs 2 assemblies: GCF_000188095.2 IS 6049248. AEQM02 IS 6127518. Therefore, this is a formatter fix.

uids for assemblies: 1571891, 6048924, 6127518. Then if you follow the GUI linkout, each of those goes to GCF_000188095.2 !!!!!!! The actual WGS record is https://www.ncbi.nlm.nih.gov/nuccore/AEQM00000000.2/ . which.... well, i dont know how we're supposed to get that from AEQM02 anyway. https://www.ncbi.nlm.nih.gov/nuccore?term=AEQM02 gives us 1038 results.

biosamples

Probably related to the biosample being linked via project.

bioprojects

no problems.

bradfordcondon commented 5 years ago

i'm going to make a child issue of this for assembly. Basically its misleading that these are listed as additional linked records. one is a WGS xref which goes to nucleotide- i think we want that to go in analysis_dbref. One is the input accession. The remaining keys are refseqUID, genbankUID, and refseq accession. These are kind of all the same analysis... so should be figured out in a separate issue since its quite complciated and i want to restructure the assembly xml parser to be easier to work with for this.

bradfordcondon commented 5 years ago

thanks for testing @mpoelchau . you should find the problem with teh biomaterial not being created resolved. The issue with the analyses records is a confusing bag of stuff so i made #194 to figure it out

mpoelchau commented 5 years ago

Confused because your log message above states that publication records are being imported, but they're not in the ncbi xml afaik and the preview display doesn't list them... That said PMID 25908251 is a bumblebee paper. those pubs don't show up in the tripal content on the droplet if I publish publications.

mpoelchau commented 5 years ago

or wait is that because we decided not to import pubs? sorry, need to look back at our comment history.

bradfordcondon commented 5 years ago

We did decide to import pubs. https://github.com/NAL-i5K/tripal_eutils/issues/141

You are right that they arent in the XML for the assembly. They get imported and linked in the Project.

Is the expected behavior that pubs associated wit hthe project wouldnt import because you are importing via the analysis? My thinking was that secondary records like pubs are still created but not linked primary records.
organisms and pubs get imported even when in a secondary record because these are kind of just "decorators". at least that was my thinking. do you not want it to work this way?

mpoelchau commented 5 years ago

I think it's fine to import them, but the admin user needs to know that they're being imported. You can't tell from the preview. Not sure if it's sufficient to just display it in the log message - even if it's documented that to get the full picture, you need to be viewing the drush log, an admin user could still be unwittingly importing pubs without realizing it if they choose not to read the documentation (guilty as charged).