draeger-lab / refinegems

refineGEMs is a python package inteded to help with the curation of genome-scale metabolic models (GEMS).
https://refinegems.readthedocs.io/en/latest/
MIT License
10 stars 1 forks source link

Feature request: Module handling URI patterns #50

Closed GwennyGit closed 1 year ago

GwennyGit commented 1 year ago

Feature A module that can transform all URIs to the same pattern or change the used pattern in a model for the whole model. The function polish_annotations from the script polish_after_MP.py already implements this functionality and could be integrated into refineGEMs as a module. Additional functionality like changing the qualifier types for specific CVTerms could be added as well.

famosab commented 1 year ago

Another point here would be to include annotation checking and see whether the biological qualifier is correct.

GwennyGit commented 1 year ago

The functionality of the script polish_after_MP.py was extended to include the function change_qualifiers. This function can handle a list of a specified type and change the qualifier type as well as the biological/model qualifier type according to the user input. There is also the option to only change the qualifiers for a certain database prefix. Additionally, the script could be enhanced to handle multiple inputs for the qualifiers and certain prefix databases, if one would want that.
The function change_qualifiers could also be a good addition to refineGEMs.

famosab commented 1 year ago

I think we should add change_qualifiers and polish_annotations to refineGEMs. MIRIAM compliance is desirable and since you already wrote the script it should be easy to integrate.

famosab commented 1 year ago

Transfer:

Modify:

GwennyGit commented 1 year ago

add_new_curie_set: Iterates over a dictionary of database prefix and identifier key-value pairs and adds the complete new curies with the user-provided pattern specification to the model.
add_curie_set: Iterates over a set of CURIEs already existing in the model and adds these to the newly specified CVTerm. (As for change_qualifiers only the CVTerm is changed this was deemed enough. However, it could also be rewritten to adjust the pattern according to user input for the CVTerms where the qualifier types are changed)


The generate_cv_term function is necessary to be able to provide a specific qualifier as well as biological/model qualifier type.


Regarding resolve_improper_identifiers we could add it to refineGEMs, however, the table used as input was done by hand and if identifiers occur in the user's model but not in the table this table should be extended with these identifiers. The function was created due to warnings that were caused by ModelPolisher, hence, a better way would be to update the BIGG database or the annotateDB or to improve ModelPolisher. I think the main problem is the BIGG database. I have already written issues regarding these topics in the respective repositories.

famosab commented 1 year ago

Ok then we will leave out resolve_improper_identifiers for now. It can be added later if desired.


Maybe rewriting the functions I wrote for CVTerms to use the generate_cvterm function might make the code more readable. But that is optional since everything works as is.

famosab commented 1 year ago

Regarding the integration of the two new functions into polish some questions came up for me:

The branch is up-to-date and we can go in any direction.

GwennyGit commented 1 year ago

In principle, I think it makes the most sense to use the new pattern. It is the current pattern that will work, and I do not know if the old pattern will be deprecated soon. For the ontologies (as I stated in the script) I only implemented the use of the new pattern as the old one did not seem to work for the URIs I checked.
I am not sure if we could check the existing pattern. I mean probably in the function where the dictionary is generated, but that would not account or check for the whole model but only for the current CVTerm that is handled. 🤔


I think integrating change_qualifiers into polish would only make sense with the proposed wrapper function. Otherwise, I would leave this function as standalone and maybe add it to the YAML file.


Using change_qualifiers in polish to check and change the qualifiers for the model makes sense and would enhance the module in my opinion. However, I would combine this function with the function add_new_curie_set such that during the check the pattern is changed. In addition, I would keep each of the functions as standalone. What do you think?

famosab commented 1 year ago

Ok, I will write the wrapper so that both functions can be integrated into polish. The way I added the functions to the module automatically allows the user to use them separately. They will not automatically write to the new model but rather return the changed model so that the user can write it themselves. I will check the polish script again - this script is aimed at inexperienced users that just want to polish their models and that do not care so much for specific functions. change_qualifiers and polish_annotations can accessed via refinegems.polish.change_qualifiers() and refinegems.polish.polish_annotations() that should be enough. We should also mention them in the documentation or a possible sample script to make them more accessible.

famosab commented 1 year ago

What kind of Qualifiers to we set for the different entities. For model it is obvious but for the gene product for example - do we use BIOLOGICAL_QUALIFIER and BQM_IS in the wrapper function? What about the reactions which have BQM_OCCURS_IN since they are part of a pathway?

famosab commented 1 year ago

Maybe you can have a look at the way I implemented it now. polish_annotations and change_all_qualifiers are now part of the polish fucntion. I only used BQM_IS qualifiers at the moment-

I don't really see how I would add add_new_curie_set seperately now that both functions are included. But maybe I am missing something and this will provoke unwanted behavior. I will test it on one of my models later to see what happens. You might see it directly since you implemented the other functions :D

GwennyGit commented 1 year ago

I took a look and stated in #58 already that for the GeneProduct we would have to set it to isHomologTo if the model was created directly from a lab strain. Additionally, if we use BIOLOGICAL_QUALIFIER would it not be then BQB_IS? Regarding other entities, so for units and UnitDefinitions I used the MODEL_QUALIFIER as this makes more sense if one looks at the description here: https://web.archive.org/web/20211204231945/http://co.mbine.org/specifications/qualifiers. There the qualifier would be BQM_IS except for the Millimoles per gram (dry weight) per hour as there is a pubmed annotation contained which has the qualifier bqmodel:isDescribedBy. To the issue with the BQB_OCCURS_IN, are we able to check for reaction IDs to solve this?

GwennyGit commented 1 year ago

add_curie_set: Takes a set of complete MIRIAM CURIEs & adds those to a new CVTerm object → Was implemented for change_qualifiers as only the qualifiers should be changed add_new_curie_set: Takes a dictionary with the database prefix as key and the identifiers belonging to that database as set, generates complete MIRIAM CURIEs from the dictionary with the user-specified pattern & adds those to a new CVTerm → Was implemented for polish_annotations as only the MIRIAM CURIEs should be changed

GwennyGit commented 1 year ago

We could change the behaviour as such that the add_new_curie_set function is replaced with a generate_new_curie_set function which gets a dictionary with prefixes mapping to the respective identifier set as well as the boolean new_pattern. This function then generates a set of CURIEs from the provided dictionary with the provided pattern and returns it. Thus, the result of generate_new_curie_set can be added to the model with the add_curie_set function.

famosab commented 1 year ago
famosab commented 1 year ago

Problems / Questions I ran into:

  1. I do not know how to access the name of the SBase object. So I cannot check whether the entity is a Reaction or a UnitDefinition. I left it out for now and I am just checking whether the BiologicalQualifierType is 9 (BQB_OCCURS_IN) or 6 (BSB_IS_HOMOLOG_TO) and then I print to the screen that the CVTerm QualifierType is not changed.
  2. Even without this check the whole polish script seems to work on a model which had multiple different CVTerms and it did not delete any of those. Note: it does work to keep the BQB_OCCURS_IN with the test via the integer representation.
  3. Using for i in range(len(cvterms)): to iterate through the CVTerms did not work for me. I used for cvterm in cvterms:- but that makes deleting the respective CVTerm harder. I just left it out at the moment. It seems to not have any negative impact on the Model I tested this with.
GwennyGit commented 1 year ago

Tested the script on all of my three models (https://github.com/draeger-lab/Shaemolyticus: ATCC29970, JCSC1435 & IMITSC147). Found several bugs and fixed these. The script seems to run now properly and can be integrated into dev. I also improved some printouts.

famosab commented 1 year ago

Everything seems to be working. For future bugs or ideas we should open a new issue.