improve usability - Githubissues

SysBioChalmers / GECKO

Toolbox for including enzyme constraints on a genome-scale model.

http://sysbiochalmers.github.io/GECKO/

MIT License

67 stars 49 forks source link

improve usability #51

Closed edkerk closed 1 year ago

edkerk commented 5 years ago

It is not that straight-forward to generate models, integrate proteomics data and perform simulations with gecko. I have recently run into the following hurdles:

Lack of documentation. The appendix to the gecko paper is helpful, but doesn't cover all functions, while the readthedocs provides only minimal information.
Hard-coded model-specific parameters, such as reaction names and IDs, organism name, medium composition, growth simulations, fitting of GAM etc.
Hard-coded locations of files and scripts.
Nested functions, such as simulateChemostat
Related to the points above, enhanceGEM works well to generate ecYeast, but might be largely unsuitable for other models.

Besides the documentation, probably a lot can be improved by defining global variables. This includes things like reaction IDs for substrate uptake, location of data files etc. This would also simplify the nested functions, as parameters can be accessed by each function. It's imperative that model-specific functions exist, such as simulateChemostat and manualModifications, but this way it would be easier to adjust those for wider applicability.

BenjaSanchez commented 5 years ago

@edkerk thanks for pointing out these problems. Below some answers:

Lack of documentation

We will be working soon on providing more documentation, let us know which parts are the ones that could use more work.

Hard-coded model-specific parameters

A new PR will be made this week including a better way of handling Pbase and Ptot. We will also clarify in the usage section of the documentation which are the functions that should be replaced by the user and which can be ignored if desired (such as fitGAM.m, which can remain unchanged if scaleBioMass.m is modified to not call it).

Hard-coded locations

All location paths of files/scripts are relative to from where they are called, so you should not have any problems with this, as long as you don't change the directory in your modifications. Let us know if this is not the case for any specific file/script.

Nested functions

simulateChemostat is only used for fitting GAM so I don't think it's a problem that it's nested, as fitGAM.m can be modified by the user anyways.

Other models

Note that we will be migrating all models to ecModels (currently to a large extent WIP), which will make enhanceGEM.m work for any of the models that we will host there.

Global variables

I am not a fan of global variables, as they make the code less pure, create confusion and makes debugging harder. Some GECKO functions do seem to have many inputs, but perhaps a more sensitive solution for some of those inputs would be to put them in a single structure containing them as fields.

Finally, feel free to create additional issues from this discussion following up on specific topics :)

edkerk commented 5 years ago

If a bunch of scripts have to be modified for each model, doesn't this impede the automated generation of a whole set of models as soon as gecko changes? Such a system would mean individual branches in gecko for each model?

Regarding global variables, an alternative could then be to define all necessary model specific parameters as fields in one structure and use that structure as input for most functions? Then the user needs to change the field in the one structure, and not necessarily that many scripts.

Lack of documentation

Some points that need clarification:

For each of the functions it is unclear what the input and output parameters are refering to.
enhanceGEM seems to be the key script, but it's unclear what is the difference between the initial ecModel that is generated, then the "batch" version, and finally there is no example of an ecModel constrained with proteomics data.

BenjaSanchez commented 5 years ago

Such a system would mean individual branches in gecko for each model?

For models hosted in ecModels, it would actually be different folders for each organism, and in each of them you can keep in ./scripts any functions that are modified to your specific organism. Of course we want to minimize the amount of functions to manually change, but some of them are hard to get around.

define all necessary model specific parameters

That's a great idea that we could start thinking about @IVANDOMENZAIN the following fields should be included in that parameters structure (at least):

parameters.org_name: Organism scientific name
parameters.enzyme_comp: Short name of the compartment where enzymes should be located (e.g. c)
parameters.sigma: Initial average saturation as a fraction, to be fitted later
parameters.Ptot: Protein content in the cell [g/gDW]
parameters.gR_exp: Minimal growth rate that the model should grow at [1/h]
parameters.c_source: Name of the reaction that has the carbon source exchange reaction

This could be stored as a single .m file and that way enhanceGEM.m would be entirely generic + many functions would be reduced in the amount of inputs.

documentation

Noted, will hopefully get around it in the coming months :)

BenjaSanchez commented 4 years ago

With PR #76 now merged to devel, I believe all issues in this thread have been addressed with exception of better documentation and hard-coded locations, which have their own specific issues now (#87 and #55, respectively). @edkerk let me know if I might have forgotten something, otherwise this issue will be closed on the next release.

sulheim commented 4 years ago

I definitely support this issue, and I think there is still some improvements to make here. It is not straight forward to run this pipeline for a different organism than yeast. I don't think modifying 7 different matlab files is a user-friendly way of tailoring the program to the organism of choice.

My suggestion is rather that the GECKO-pipeline takes a parameter-file as the user-supplied input along with the GEM. The parameter file can also provide the paths for the required database-files, cultivation data etc. (now provided by changing files in the /database folder).

BenjaSanchez commented 4 years ago

My suggestion is rather that the GECKO-pipeline takes a parameter-file as the user-supplied input

that is what we wanted to achieve with getModelParameters, available in devel

I don't think modifying 7 different matlab files is a user-friendly way

note that 3 of those 7 functions are optional (can be skipped entirely), and one of them is the getModelParameters input script. Regarding the other 3 (manualModifications, sumProtein & scaleBioMass), IMO the changes performed inside are too specific to generalize, as they depend a lot on manual curation / naming conventions / biomass styling in the model (although perhaps manualModifications could be generalized to rely on a separate input file). Hence the easiest solution we have found so far is to just leave it to the user to modify them. However if you have a better setup in mind feel free to share here :)

parameter file can also provide the paths for the required database-files, cultivation data etc.

that is a nice idea, we can address it in #55

sulheim commented 4 years ago

I understand that it is difficult to generalize the pipeline, however I think it would improve the usability a lot in particular for other strains than yeast.

I don't have a full overview of the source code, but some suggestions:

Try to change hard-coded settings into parameter files/settings.
Is it necessary to hard-code the biomass-composition into the sumBiomass.m? This is given by the reactants in the biomass-reaction? Maybe take an input table with the classification of each biomass component? Maybe also provide an option if the biomass-recation already is separated into separate pseudo-reactions?
For the optional inputs: It is not clear what is the best way to toggle these optional features on or off. Provid a toggle in the parameter file to choose which parts of the pipeline to leave out.
- The text file with the manual modifications is nice, however there are manual modifications in manualModifications.m that naybe could fit into the text-file as well? Seems like curation_growthLimiting, curation_carbonSources, curation_topUsedEnz are just changing the kcat value of specific enzymes.

The getModelParameters.m is absolutely a good step in the right direction, so I would suggest to move all settings and choices into that file.

BenjaSanchez commented 4 years ago

Thanks for the suggestions!

Try to change hard-coded settings

Added that to #55.

Is it necessary to hard-code the biomass-composition into the sumBiomass.m

There are many ways in which people construct biomass equations:

all together or "lumped" using mmol/gDW coeffs
spliting into components or "segregated" using mmol/gDW coeffs, i.e. the final biomass eq is: prot + carb + ... -> biomass
spliting into components or "segregated" using mmol/g coeffs, i.e. the final biomass eq is: 0.45 prot + 0.4 carb + ... -> biomass
further splitting some components into yet other sections (e.g. SLIMEr for lipids)
even more I'm not familiar with?

Trying to have logics to detect any of these scenarios for any GEM that may come is IMO not worth it. Just easier if the model developer, who is well familiarized with their biomass equation, writes their own logic in sumBiomass.m. Specially considering that sumBioMass.m is optional as it's only called by scaleBioMass and sumProtein (the former which can be left empty if you don't want to rescale the model and the latter which can be replaced by a simple calculation of the protein fraction).

It is not clear what is the best way to toggle these optional features

We could have probably phrased it better: When we say "you don not need to define" a given parameter, we actually mean "you can delete these lines". So it accomplishes the same as toggling sections on/off.

there are manual modifications in manualModifications.m that could fit into the text-file as well

I removed the fixed in devel label of this issue until we implement that suggestion

edkerk commented 1 year ago

No longer relevant.