NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

233 namespace conflict #235

Closed Ferrisx4 closed 2 years ago

Ferrisx4 commented 2 years ago

Upon install, newly agreed upon names for the cv and db will be installed and the NCBI terms will be inserted using those values. If an existing install is detected, the new names will be installed and the existing terms updated and not re-inserted.

Satisfies the first two objectives in this issue

dsenalik commented 2 years ago

Issues in combination with tripal_analysis_expression module

(1) Order of running the updates of the two modules matters. If the tripal_eutils update runs after the tripal_biomaterial update, we get an error. The update code in tripal_eutils moves things to the new CV but doesn't check if it already exists there. The tripal_biomaterial code just creates the new CV and populates with downloaded NCBI xml, and leaves the old one untouched, which is probably safer?

$ drush updatedb
 Tripal_biomaterial  7303  Adds the NCBI biomaterial database and adds the terms to it.  Also moves  
                           terms from the tripal vocabulary into local.                              
 Tripal_eutils       7303  Convert terms from ncbi_properties to NCBI_BioSample_Attributes
Do you wish to run all pending updates? (y/n): y
Performed update: tripal_biomaterial_update_7303                                             [ok]
Current ncbi_properties cv_id: 106
New: NCBI_BioSample_Attributes cv_id: 124
SQLSTATE[23505]: Unique violation: 7 ERROR:  duplicate key value violates unique constraint  [error]
"cvterm_c1"
DETAIL:  Key (name, cv_id, is_obsolete)=(chloride, 124, 0) already exists.
Performed update: tripal_eutils_update_7303                                                  [ok]
'all' cache was cleared.                                                                     [success]
Finished performing updates.                                                                 [ok]

That error comes from this part

// Find all  occurences of cvterms with the old cv_id and update them to the new one
  $num_updated = db_update('chado.cvterm')
  ->fields(array(
    'cv_id' => $new_cv_id,
  ))
  ->condition('cv_id', $cv_id, '=')
  ->execute();

and if you run in the other order, 140 errors. However, these errors show up because the dbxref still points to the ncbi_properties db.

(2) Since we are making new db and cv we should agree on common code so that they will be the same regardless of which module runs first. Currently we have these two versions (I favor the more verbose versions): tripal_eutils:

  chado_insert_db([
    'name' => 'NCBI_BioSample_Attributes',
    'description' => 'Attribute and property terms for NCBI.',
    'url' => 'http://www.ncbi.nlm.nih.gov/',
  ]);

  chado_insert_cv('NCBI BioSample Attributes',
    'The ncbi BioSample Attributes CV is downloaded from https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml.');

tripal_biomaterial:

  chado_insert_db([
    'name' => 'NCBI_BioSample_Attributes',
    'description' => 'This database provides, in XML format, the listing of attribute names for biosamples housed in NCBI.',
    'url' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes',
    'urlprefix' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes#'
  ]);

  chado_insert_cv('NCBI BioSample Attributes',
                   'Attributes describe a BioSample using structured attribute name:value pairs, for example, tissue:liver. BioSample maintains a list of recognized attributes which participate in one or more BioSample packages. In addition to recognized attributes, submitters may provide any number
 of custom attributes to fully describe a sample.');

Note: the urlprefix doesn't actually work to go to the accession, that is just a web page with a table. Maybe drop that?

(3) tripal_eutils adds a very few terms to NCBI BioSample Attributes beyond what is defined by NCBI, tripal_analysis_expression uses the biomaterial_property cv for these new terms. Is this worth changing?

dsenalik commented 2 years ago

I have removed 'urlprefix' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes#' from the tripal_biomaterial pull request, due to the problem I noted that it doesn't work as a link directly to an attribute, rather just to a page listing all attributes.

Ferrisx4 commented 2 years ago

I will leave the urlprefix out of tripal_eutils with the aim of making this db entry match between the two modules.

Ferrisx4 commented 2 years ago

Hi @dsenalik I've made some changes and tested it with tripal_biomaterials. I've tested installing both of them in either order several times and it seems to work properly. Can you give it another pass? Thanks

For your third point above, I have not changed which CV they are inserted into, I'm not sure there's any collision with the tripal_biomaterial module and I'm trying to have a light touch.

dsenalik commented 2 years ago

I did my first tests, running in both orders, and no errors. However no terms now get migrated, I think that is now intentional from your comment above, and that is how tripal_biomaterial handles it too. So it will be up to a site admin to change existing properties if desired. I think that would call for some additional documentation for the module to explain that fact, so I can work on that.

dsenalik commented 2 years ago

Draft documentation changes can be seen at https://github.com/dsenalik/tripal_analysis_expression/tree/ncbi_biosample_space#module-updating

dsenalik commented 2 years ago

The "custom" terms are currently left out of the update process. Line 27 of tripal_eutils.install does this

  chado_insert_cvterm(
    [
      'id' => 'NCBI_BioSample_Attributes:submitter_provided_accession',
      'name' => 'submitter_provided_accession',
      'cv_name' => 'NCBI BioSample Attributes',
    ]
  );

but it is not done with update 7303. I propose you move this to tripal_eutils_insert_extra_biosample_terms() and call this from both the initial install and again in update 7303. And add an , ['update_existing' => FALSE] as we did here

Also those tripal_insert_cvterm() calls should be changed to chado_insert_cvterm().

dsenalik commented 2 years ago

I made these changes and loaded an assembly, no errors! Yay!

dsenalik commented 2 years ago

I had tested with ['update_existing' => FALSE] added to all of the calls in function tripal_eutils_insert_extra_biosample_terms() but this doesn't trigger any error, so it may not make any functional difference.

dsenalik commented 2 years ago

With the current version, I have tested loading five different assemblies and "Create Linked Records" and have no errors and successful loading.

dsenalik commented 2 years ago

With the recent merge in the tripal_biomaterial module https://github.com/tripal/tripal_analysis_expression/pull/411 an error is triggered with tripal_eutils_update_7303, example error message shown below. This occurs if the tripal_biomaterial update ran first. The reason is an incorrect parameter to chado_insert_cvterm() of 'def' instead of 'definition, and now the two modules are trying to generate the terms differently.

Can you please correct these in tripal_eutils.install lines 43, 52, 60, 84 and in includes/repositories/EUtilsAssemblyRepository.inc line 135

As promised, the error message:

Current ncbi_properties cv_id: 106
New: NCBI_BioSample_Attributes cv_id: 126
ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
(
    [cv_id] => 126
    [name] => age
    [definition] => 
    [dbxref_id] => 444493
    [is_obsolete] => 0
    [is_relationshiptype] => 0
)

[site http://default] [TRIPAL ERROR] [TRIPAL_CHADO] chado_insert_record; Cannot insert duplicate record 
sizeof(): Parameter must be an array or an object that implements Countable                  [warning]
tripal.notice.api.inc:134
[site http://default] [TRIPAL WARNING] [TRIPAL_CV] Failed to insert the term: age (NCBI_BioSample_Attrib

and many further similar errors

dsenalik commented 2 years ago

I just tested drush updatedb and this change has resolved the error, looks good!

dsenalik commented 2 years ago

Okay, so there is an issue with the print_r() lines in the update. I had only run from drush command line before, where they work fine. If you run from the database updates page, those cause an ajax error, though everything has run okay I think. Screenshot_2022-06-22_08-26-06

dsenalik commented 2 years ago

I have just tested with the two print_r() commands commented out, and updates proceed as expected without error. tripal_eutils.install lines 296 and 304

Ferrisx4 commented 2 years ago

Followed up by a quick PR to remove some problematic print statements: https://github.com/NAL-i5K/tripal_eutils/pull/239