Open sulheim opened 4 years ago
What an interesting question! A git-based workflow allows for versioning of code and model. For the model, there will be an input model (prev commit), and an output model (new commit). It sounds like a great idea to follow the approach described above. I'm not sure what would be easy enough though, but I feel it ought to involve some way of glueing together the models and the code.
If you have the energy to guide and maintain it, I think a public gitbook could be a great place for such a guide. Thus it can be continuously updated from the community. It takes effort to steer such an effort and maintain a comprehensible whole, though.
Documentation of model curation is essential in GEM development. A well-defined Git-based workflow would help in achieving this goal, therefore should be within the scope of standard-GEM
.
@Midnighter I am not familiar with gitbook, but from a brief look it seems like it might be too much work and something that is not neccessarily maintained along with the model on github. Maybe a more realistic option is to create templates for model reconstruction scripts (e.g. in MATLAB or python) that ensures a minimum of documentation along with the reconstruction.
Along those lines we could start thinking about a minimum information requirement that should be reported about the steps taken to create a GEM. Such guidelines exist already for various other aspects of science, in systems biology MIRIAM is a prominent example but there are plenty of others. Of course, there is Ines Thiele's famous protocol for generating a high-quality GEM, but we could start collecting key points what needs to go into such a documentation that @sulheim requests.
@Midnighter I am not familiar with gitbook, but from a brief look it seems like it might be too much work and something that is not neccessarily maintained along with the model on github. Maybe a more realistic option is to create templates for model reconstruction scripts (e.g. in MATLAB or python) that ensures a minimum of documentation along with the reconstruction.
I agree, gitbook was my recommendation for a more meta guide on how to construct models general, not to serve as documentation alongside one specific model.
In this context, I would like to discuss how one should organize model reconstruction and curation scripts as well as model files. We are currently curating and re-organizing the Sco-GEM model folder (to adher to Standard-GEM template), see https://github.com/SysBioChalmers/Sco-GEM/pull/122.
We have encountered an issue where it is rather inconvenient to test / update curation scripts that has been used previously to update the model as the model file in the repository always is the latest version. E.g. if you have previously written and applied a script that is deleting a few model reactions, and you want to modify and rerun that script, you cannot test that script on the model file in the repository. One solution is to keep an archive folder with previous model versions, but believe there might be more clever solutions to this issue.
What do you think?
But would an archive
model folder not sort of defeat the purpose of git? Meanwhile, older releases can relatively easily be extracted from the local repository with e.g.
git show refs/tags/v1.4.2:model/standard-GEM.xml > model_v1_4_2.xml
or latest master
version
git show master:model/standard-GEM.xml > model_master.xml
Somewhat related to this, we are implementing an approach with Human-GEM to deal with old curation/reconstruction scripts that do not work with the current model version by moving them to a deprecated
folder (I also like @sulheim's archive
suggestion as a name). This would separate these scripts from those that are currently maintained, so there is no expectation that they should function as expected.
If one wanted to run an archived script, then they can checkout the commit when the script was last modified or used, when presumably the corresponding model version at the time of the commit would be compatible with that script.
I don't think an archive
folder defeats the purpose of git (although I see your point @edkerk), I think git is much more than just the access to previous model versions through the log. However, your suggestion of just reading the model file from the master branch seems pretty elegant. I still think that @JonathanRob has a good point, however these two solutions are not exclusive. This is basically what the same as we have done with the sulheim2020 folder in the Sco-GEM repo.
@sulheim, a Yaml-based workflow implemented in Human-GEM
may provide another option for curating GEMs.
Previously, we also used scripts for adding/removing reactions and making changes to model. As @JonathanRob mentioned, now we are archiving the old code and retiring the script-based approach. In the new workflow, only a Yaml format model file is retained in develop and other fix/feature branches. Given the human-readable feature, changes made to Yaml file are evident and clear enough so that script-independent curation is allowed.
For example, in the PR #213 a number of duplicated metabolites and reactions were removed by a series of commits, each of which resolves one duplicated met. In particular, the metabolite malthx_s
and associated reaction EX_M02447[e]
were deleted in this commit where the annotation files were also updated. With this work flow, the changes can be made either by code or manually, and conveniently reviewed afterwards. A couple of assisting code (testYamlConversion, sanityCheck) were provided as check points before and after making PR to avoid mistakes.
This workflow is still under development and refinement. But it seems that this works pretty well so far.
That's an interesting workflow @Hao-Chalmers . Altough I understand that one can reproduce the model development by going through each individual commit and redo the manual curations, it sounds less tidy than having all edits documented in a script. I think it makes sense to only have the yaml-format in the devel folder though (but is that compatible with the COBRA toolbox in Matlab?).
@sulheim The yaml-format actually was adapted from Cobrapy. So you'd have built-in support under Python environment when using COBRA.
Yes, you can only keep the yaml file in devel
and feat/fix
branches for tracking model changes, then the scripts probably are not necessary (but they still can be kept, such as in an archive
folder for reference).
Reading this issue again, I find it an interesting discussion to continue. However, it would be very ambitious to construct a roadmap that would enable full reproducibility, including all the curation. Therefore, I propose to add the use of the deprecated
folder to .standard-GEM.md
and then convert this issue into a GitHub Discussion.
Ok. But isn't archive
a better name than deprecated
?
Ok. But isn't
archive
a better name thandeprecated
?
The message I am trying to send is attuned to the definitions as in the Merriam-Webster dictionary:
a place in which public records or historical materials (such as documents) are preserved
to withdraw official support for or discourage the use of (something, such as a software product) in favor of a newer or better alternative
That being said, I would appreciate more thoughts on the matter, especially that I feel more people are familiar with the term archive
.
Description of the issue:
It is not clear to me if this is outside the scope of standard-GEM, but I think it would be useful to come up with a language-agnostic guideline / template for how to document the development-process so it is easy for anyone to understand what and how the model reconstruction is performed, and how one can reproduce the current state of the model. One common practice (that I've used) is to have a script which performs the complete model reconstruction from any given starting point. This works reasonably well, but it still not trivial for someone else to trace the reconstruction unless the code is very well documented.
What do think is the best practice that should be recommended to users of standard-GEM?