Enhance the `polish` module

draeger-lab / refinegems

refineGEMs is a python package inteded to help with the curation of genome-scale metabolic models (GEMS).

https://refinegems.readthedocs.io/en/latest/

MIT License

10 stars 1 forks source link

Enhance the `polish` module #58

Open GwennyGit opened 1 year ago

GwennyGit commented 1 year ago

This issue was opened to collect all enhancements for the `polish` module.

[ ] Change Regex for biomass to first result of biomass.test_biomass_presence
[ ] #36
[ ] Add EC numbers -> Function in SBOAnnotator
[x] #53
[x] Add functionality to remove NaNs if found in annotations
[x] Add checks for BiGG identifiers added via the VMH identifiers
[x] #89
[x] #91
[x] #8
[x] #10
[x] #24
[x] #39
[x] #38
[x] #50
[x] Change add_reac to handle growth|_*biomass\d*_* IDs properly

GwennyGit commented 1 year ago

I just detected that currently the function add_reac in the polish module does not handle the Growth or Biomass reaction properly. This is due to the fact that current_id is shortened - which is necessary for all other reactions - like that: current_id = current_id[2:]. However, if the ID of the reaction is for example 'Growth' the result for current_id will be 'owth'. Thus, the function cannot handle these IDs with the regex 'growth|_*biomass\d*_*' properly. The code needs to be adjusted to handle these exceptions correctly.

famosab commented 1 year ago

At least in my models the growth function has the id id="R_Growth" which would mean that removing the trailing R is still valid and would not break the regex function. We might need to insert a check whether the trailing R is really there before we try to remove it. Or am I missing something?

        # Get ID and remove 'R_'
        current_id = entity.getId()
        if current_id[:2] == 'R_':
            current_id = current_id[2:]

GwennyGit commented 1 year ago

Regarding the ID for growth, this seems to vary between models a lot. We could also just change the regex to include R_ → R_?growth|R_?_*biomass\d*_*. However, the proposed change is also good and might be better as other reactions could also exist without R_. Looking at your changes I realised one problem with the function change_all_qualifiers. If one has a lab strain model the GeneProtein qualifier should be set to isHomologTo as we discussed before.

GwennyGit commented 1 year ago

As mentioned in these comments https://github.com/draeger-lab/refinegems/issues/52#issuecomment-1426221386 and https://github.com/draeger-lab/refinegems/issues/52#issuecomment-1426225397. It would indeed make sense to add compartment specifications for all reactions which happen in the same compartment within polish.

draeger commented 1 year ago

COBRApy manipulates IDs back and forth. This prefix is only there in the SBML export. Upon import R_ is stripped away.

famosab commented 1 year ago

In libsbml the trailing R is not stripped though. That is sometimes confusing. (The same is true for genes and metabolites).

draeger commented 1 year ago

Of course not. This entire ID manipulation is nonsense and causes chaos. There should be separate fields for every piece of information that the BiGG ID ships.

GwennyGit commented 1 year ago

I think the R_ for reactions is used to clarify that this is a reaction ID within the model. R_ is not part of the BiGG IDs same as M_ for metabolites/species and G_ for groups and GeneProducts.

famosab commented 1 year ago

But for the polishing module we mostly use libsbml, I think we only even use cobrapy for the growth simulations (and a few more small things). So we always need to take care of those trailing IDs (but through string slicing that is not an issue I think). One could argue that they are obsolete since the entity itself holds the information already - but that is something we cannot change here (or shouldn't on our own).

GwennyGit commented 1 year ago

While investigating models from Heinken et al.[1] I realised that improving the function get_curie_set within the polish module could be interesting. As these models contain NaN identifiers and the function get_curie_set assembles a dictionary mapping the database of each CURIE to the correspondingly found identifier it would be helpful to also check for NaNs and remove the according entries.

[1] Heinken, A., Hertel, J., Acharya, G. et al. Genome-scale metabolic re- construction of 7,302 human microorganisms for personalized medicine. Nature Biotechnology. issn: 1546-1696. doi:10.1038/s41587-022-01628-0 (Jan. 2023).

draeger-lab / refinegems

Enhance the `polish` module #58

This issue was opened to collect all enhancements for the polish module.

This issue was opened to collect all enhancements for the `polish` module.