Closed Ferrisx4 closed 2 years ago
Issues in combination with tripal_analysis_expression module
(1) Order of running the updates of the two modules matters. If the tripal_eutils update runs after the tripal_biomaterial update, we get an error. The update code in tripal_eutils moves things to the new CV but doesn't check if it already exists there. The tripal_biomaterial code just creates the new CV and populates with downloaded NCBI xml, and leaves the old one untouched, which is probably safer?
$ drush updatedb
Tripal_biomaterial 7303 Adds the NCBI biomaterial database and adds the terms to it. Also moves
terms from the tripal vocabulary into local.
Tripal_eutils 7303 Convert terms from ncbi_properties to NCBI_BioSample_Attributes
Do you wish to run all pending updates? (y/n): y
Performed update: tripal_biomaterial_update_7303 [ok]
Current ncbi_properties cv_id: 106
New: NCBI_BioSample_Attributes cv_id: 124
SQLSTATE[23505]: Unique violation: 7 ERROR: duplicate key value violates unique constraint [error]
"cvterm_c1"
DETAIL: Key (name, cv_id, is_obsolete)=(chloride, 124, 0) already exists.
Performed update: tripal_eutils_update_7303 [ok]
'all' cache was cleared. [success]
Finished performing updates. [ok]
That error comes from this part
// Find all occurences of cvterms with the old cv_id and update them to the new one
$num_updated = db_update('chado.cvterm')
->fields(array(
'cv_id' => $new_cv_id,
))
->condition('cv_id', $cv_id, '=')
->execute();
and if you run in the other order, 140 errors. However, these errors show up because the dbxref still points to the ncbi_properties
db.
(2) Since we are making new db and cv we should agree on common code so that they will be the same regardless of which module runs first. Currently we have these two versions (I favor the more verbose versions): tripal_eutils:
chado_insert_db([
'name' => 'NCBI_BioSample_Attributes',
'description' => 'Attribute and property terms for NCBI.',
'url' => 'http://www.ncbi.nlm.nih.gov/',
]);
chado_insert_cv('NCBI BioSample Attributes',
'The ncbi BioSample Attributes CV is downloaded from https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml.');
tripal_biomaterial:
chado_insert_db([
'name' => 'NCBI_BioSample_Attributes',
'description' => 'This database provides, in XML format, the listing of attribute names for biosamples housed in NCBI.',
'url' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes',
'urlprefix' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes#'
]);
chado_insert_cv('NCBI BioSample Attributes',
'Attributes describe a BioSample using structured attribute name:value pairs, for example, tissue:liver. BioSample maintains a list of recognized attributes which participate in one or more BioSample packages. In addition to recognized attributes, submitters may provide any number
of custom attributes to fully describe a sample.');
Note: the urlprefix doesn't actually work to go to the accession, that is just a web page with a table. Maybe drop that?
(3) tripal_eutils adds a very few terms to NCBI BioSample Attributes beyond what is defined by NCBI, tripal_analysis_expression uses the biomaterial_property cv for these new terms. Is this worth changing?
I have removed 'urlprefix' => 'https://www.ncbi.nlm.nih.gov/biosample/docs/attributes#'
from the tripal_biomaterial pull request, due to the problem I noted that it doesn't work as a link directly to an attribute, rather just to a page listing all attributes.
I will leave the urlprefix out of tripal_eutils with the aim of making this db entry match between the two modules.
Hi @dsenalik I've made some changes and tested it with tripal_biomaterials. I've tested installing both of them in either order several times and it seems to work properly. Can you give it another pass? Thanks
For your third point above, I have not changed which CV they are inserted into, I'm not sure there's any collision with the tripal_biomaterial module and I'm trying to have a light touch.
I did my first tests, running in both orders, and no errors. However no terms now get migrated, I think that is now intentional from your comment above, and that is how tripal_biomaterial handles it too. So it will be up to a site admin to change existing properties if desired. I think that would call for some additional documentation for the module to explain that fact, so I can work on that.
Draft documentation changes can be seen at https://github.com/dsenalik/tripal_analysis_expression/tree/ncbi_biosample_space#module-updating
The "custom" terms are currently left out of the update process. Line 27 of tripal_eutils.install
does this
chado_insert_cvterm(
[
'id' => 'NCBI_BioSample_Attributes:submitter_provided_accession',
'name' => 'submitter_provided_accession',
'cv_name' => 'NCBI BioSample Attributes',
]
);
but it is not done with update 7303. I propose you move this to
tripal_eutils_insert_extra_biosample_terms()
and call this from both the initial install and again in update 7303.
And add an , ['update_existing' => FALSE]
as we did here
Also those tripal_insert_cvterm()
calls should be changed to chado_insert_cvterm()
.
I made these changes and loaded an assembly, no errors! Yay!
I had tested with ['update_existing' => FALSE]
added to all of the calls in function tripal_eutils_insert_extra_biosample_terms()
but this doesn't trigger any error, so it may not make any functional difference.
With the current version, I have tested loading five different assemblies and "Create Linked Records" and have no errors and successful loading.
With the recent merge in the tripal_biomaterial module https://github.com/tripal/tripal_analysis_expression/pull/411 an error is triggered with tripal_eutils_update_7303
, example error message shown below. This occurs if the tripal_biomaterial
update ran first. The reason is an incorrect parameter to chado_insert_cvterm()
of 'def'
instead of 'definition
, and now the two modules are trying to generate the terms differently.
Can you please correct these in tripal_eutils.install
lines 43, 52, 60, 84
and in includes/repositories/EUtilsAssemblyRepository.inc
line 135
As promised, the error message:
Current ncbi_properties cv_id: 106
New: NCBI_BioSample_Attributes cv_id: 126
ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
(
[cv_id] => 126
[name] => age
[definition] =>
[dbxref_id] => 444493
[is_obsolete] => 0
[is_relationshiptype] => 0
)
[site http://default] [TRIPAL ERROR] [TRIPAL_CHADO] chado_insert_record; Cannot insert duplicate record
sizeof(): Parameter must be an array or an object that implements Countable [warning]
tripal.notice.api.inc:134
[site http://default] [TRIPAL WARNING] [TRIPAL_CV] Failed to insert the term: age (NCBI_BioSample_Attrib
and many further similar errors
I just tested drush updatedb
and this change has resolved the error, looks good!
Okay, so there is an issue with the print_r()
lines in the update. I had only run from drush command line before, where they work fine. If you run from the database updates page, those cause an ajax error, though everything has run okay I think.
I have just tested with the two print_r()
commands commented out, and updates proceed as expected without error.
tripal_eutils.install
lines 296 and 304
Followed up by a quick PR to remove some problematic print statements: https://github.com/NAL-i5K/tripal_eutils/pull/239
Upon install, newly agreed upon names for the cv and db will be installed and the NCBI terms will be inserted using those values. If an existing install is detected, the new names will be installed and the existing terms updated and not re-inserted.
Satisfies the first two objectives in this issue