MetabolicAtlas / standard-GEM

The standard for open-source GEMs on GitHub
https://www.biorxiv.org/content/10.1101/2023.03.21.512712
Creative Commons Attribution 4.0 International
19 stars 5 forks source link

Documentation of model development #20

Open sulheim opened 4 years ago

sulheim commented 4 years ago

Description of the issue:

It is not clear to me if this is outside the scope of standard-GEM, but I think it would be useful to come up with a language-agnostic guideline / template for how to document the development-process so it is easy for anyone to understand what and how the model reconstruction is performed, and how one can reproduce the current state of the model. One common practice (that I've used) is to have a script which performs the complete model reconstruction from any given starting point. This works reasonably well, but it still not trivial for someone else to trace the reconstruction unless the code is very well documented.

What do think is the best practice that should be recommended to users of standard-GEM?

mihai-sysbio commented 4 years ago

What an interesting question! A git-based workflow allows for versioning of code and model. For the model, there will be an input model (prev commit), and an output model (new commit). It sounds like a great idea to follow the approach described above. I'm not sure what would be easy enough though, but I feel it ought to involve some way of glueing together the models and the code.

Midnighter commented 4 years ago

If you have the energy to guide and maintain it, I think a public gitbook could be a great place for such a guide. Thus it can be continuously updated from the community. It takes effort to steer such an effort and maintain a comprehensible whole, though.

haowang-bioinfo commented 4 years ago

Documentation of model curation is essential in GEM development. A well-defined Git-based workflow would help in achieving this goal, therefore should be within the scope of standard-GEM.

sulheim commented 4 years ago

@Midnighter I am not familiar with gitbook, but from a brief look it seems like it might be too much work and something that is not neccessarily maintained along with the model on github. Maybe a more realistic option is to create templates for model reconstruction scripts (e.g. in MATLAB or python) that ensures a minimum of documentation along with the reconstruction.

draeger commented 4 years ago

Along those lines we could start thinking about a minimum information requirement that should be reported about the steps taken to create a GEM. Such guidelines exist already for various other aspects of science, in systems biology MIRIAM is a prominent example but there are plenty of others. Of course, there is Ines Thiele's famous protocol for generating a high-quality GEM, but we could start collecting key points what needs to go into such a documentation that @sulheim requests.

Midnighter commented 4 years ago

@Midnighter I am not familiar with gitbook, but from a brief look it seems like it might be too much work and something that is not neccessarily maintained along with the model on github. Maybe a more realistic option is to create templates for model reconstruction scripts (e.g. in MATLAB or python) that ensures a minimum of documentation along with the reconstruction.

I agree, gitbook was my recommendation for a more meta guide on how to construct models general, not to serve as documentation alongside one specific model.

sulheim commented 3 years ago

In this context, I would like to discuss how one should organize model reconstruction and curation scripts as well as model files. We are currently curating and re-organizing the Sco-GEM model folder (to adher to Standard-GEM template), see https://github.com/SysBioChalmers/Sco-GEM/pull/122.

We have encountered an issue where it is rather inconvenient to test / update curation scripts that has been used previously to update the model as the model file in the repository always is the latest version. E.g. if you have previously written and applied a script that is deleting a few model reactions, and you want to modify and rerun that script, you cannot test that script on the model file in the repository. One solution is to keep an archive folder with previous model versions, but believe there might be more clever solutions to this issue.

What do you think?

edkerk commented 3 years ago

But would an archive model folder not sort of defeat the purpose of git? Meanwhile, older releases can relatively easily be extracted from the local repository with e.g.

git show refs/tags/v1.4.2:model/standard-GEM.xml > model_v1_4_2.xml

or latest master version

git show master:model/standard-GEM.xml > model_master.xml
JonathanRob commented 3 years ago

Somewhat related to this, we are implementing an approach with Human-GEM to deal with old curation/reconstruction scripts that do not work with the current model version by moving them to a deprecated folder (I also like @sulheim's archive suggestion as a name). This would separate these scripts from those that are currently maintained, so there is no expectation that they should function as expected.

If one wanted to run an archived script, then they can checkout the commit when the script was last modified or used, when presumably the corresponding model version at the time of the commit would be compatible with that script.

sulheim commented 3 years ago

I don't think an archive folder defeats the purpose of git (although I see your point @edkerk), I think git is much more than just the access to previous model versions through the log. However, your suggestion of just reading the model file from the master branch seems pretty elegant. I still think that @JonathanRob has a good point, however these two solutions are not exclusive. This is basically what the same as we have done with the sulheim2020 folder in the Sco-GEM repo.

haowang-bioinfo commented 3 years ago

@sulheim, a Yaml-based workflow implemented in Human-GEM may provide another option for curating GEMs.

Previously, we also used scripts for adding/removing reactions and making changes to model. As @JonathanRob mentioned, now we are archiving the old code and retiring the script-based approach. In the new workflow, only a Yaml format model file is retained in develop and other fix/feature branches. Given the human-readable feature, changes made to Yaml file are evident and clear enough so that script-independent curation is allowed.

For example, in the PR #213 a number of duplicated metabolites and reactions were removed by a series of commits, each of which resolves one duplicated met. In particular, the metabolite malthx_s and associated reaction EX_M02447[e] were deleted in this commit where the annotation files were also updated. With this work flow, the changes can be made either by code or manually, and conveniently reviewed afterwards. A couple of assisting code (testYamlConversion, sanityCheck) were provided as check points before and after making PR to avoid mistakes.

This workflow is still under development and refinement. But it seems that this works pretty well so far.

sulheim commented 3 years ago

That's an interesting workflow @Hao-Chalmers . Altough I understand that one can reproduce the model development by going through each individual commit and redo the manual curations, it sounds less tidy than having all edits documented in a script. I think it makes sense to only have the yaml-format in the devel folder though (but is that compatible with the COBRA toolbox in Matlab?).

haowang-bioinfo commented 3 years ago

@sulheim The yaml-format actually was adapted from Cobrapy. So you'd have built-in support under Python environment when using COBRA.

Yes, you can only keep the yaml file in devel and feat/fix branches for tracking model changes, then the scripts probably are not necessary (but they still can be kept, such as in an archive folder for reference).

mihai-sysbio commented 3 years ago

Reading this issue again, I find it an interesting discussion to continue. However, it would be very ambitious to construct a roadmap that would enable full reproducibility, including all the curation. Therefore, I propose to add the use of the deprecated folder to .standard-GEM.md and then convert this issue into a GitHub Discussion.

sulheim commented 3 years ago

Ok. But isn't archive a better name than deprecated?

mihai-sysbio commented 3 years ago

Ok. But isn't archive a better name than deprecated?

The message I am trying to send is attuned to the definitions as in the Merriam-Webster dictionary:

archive

a place in which public records or historical materials (such as documents) are preserved

deprecate

to withdraw official support for or discourage the use of (something, such as a software product) in favor of a newer or better alternative

That being said, I would appreciate more thoughts on the matter, especially that I feel more people are familiar with the term archive.