GitHub versus specialized platform (database)

Midnighter commented 4 years ago

This is a big discussion to have. There are other approaches to standardization and it is important to clearly lay out the pros and cons.

Platforms

Pathway Tools by Peter Karp's group
KBase & ModelSEED (just tagging some associated GitHub names @cshenry, @janakagithub, @samseaver)
MetaNetX
Not public yet but @zakandrewking was developing a kind of "GitHub for GEMs" at @SBRG before he left for Amyris
Others that I'm forgetting now

Pros

I think such platforms have a lot to offer.

They provide a lot of standardization out of the box. By users simply picking components from such a database they get all of the annotations "for free".
With an active user base, the range of known reactions will gradually increase.
There is a chance for standardizing and automating many processes as is done at, e.g., KBase. Thus model reconstruction becomes accessible to a wider audience.

Probably the platform maintainers can come up with more good reasons.

Cons

We are pouring a lot of public money into building metabolic models and maintaining the tools to work with them. As such, these platforms need to fulfil a number of criteria in my opinion:

They need to be completely open source (KBase does this, Pathway Tools and MetaNetX don't).
- Clearly the methods need to be open source otherwise we are not doing science.
- Also, the whole infrastructure needs to be open source so that different research groups around the globe can collaborate. and in unfortunate cases, can take over hosting the infrastructure.
There can be no access barriers (MetaNetX does this, KBase mostly, Pathway Tools don't).
- I know it costs money to run infrastructure and maintain code but paid access is not an option when public money was and is continually used for a resource.
- This is where I see KBase being slightly problematic although it is otherwise very open. The tight government integration with the DOE means that the possibility to contribute to KBase is quite stringently controlled (and may change on political whims).

vs GitHub

Pros

Easy to sign up
Following your proposed model repo template is a fairly low barrier
Lots of freedom for model authors

Cons

Although GItHub appears like it is a public good, it is owned by a private company (Microsoft). They can revoke access (as was done for Iran in the past) and noone could replicate the infrastructure.
The freedom for model authors means less standardization and more work, e.g., on annotations for authors
It will be interesting to see how well your auto-generated overview will work in practice. Without a single point of entry the fractionation of efforts will be a serious downside.

These are just some preliminary thoughts. Happy to hear your thoughts and points of view. Either way, I think more effort is needed in this area and I'm glad that you're rising to the challenge.

haowang-bioinfo commented 4 years ago

@Midnighter thanks for bringing up this important topic.

IMO, one of the major advantages of using GitHub for hosting GEMs is the transparent and well-documented curation process, which is essential for long-term evolvement of the field.

mihai-sysbio commented 4 years ago

Excellent overview @Midnighter. To add to it, to me it's more about Platforms vs git and GitHub.

Although model changes can be done manually, in many cases they are done via scripts. I'm unsure which platforms allow code sharing together with model changes. By keeping these together we increase reproducibility.

git is distributed, so the repository itself can in theory be moved to another platform (GitLab, BItbucket), even self-hosted variants. But it's true, GitHub was and continues to be privately owned. There are also some advantages to that: features, maintenance and infrastructure at no cost.

Some of the cons of GitHub can be addressed via standard-GEM, particularly Memote testing (not implemented yet but the structure is shaping up). In the beginning, we can rely on GitHub topic indexing for the overview. But it would make more sense to build a page that provides all the information the community needs.

haowang-bioinfo commented 4 years ago

@mihai-sysbio very good point. Although standard-GEM begins with GitHub, its development should aim toward more generic guidelines that can be applied for any Git-based platforms, e.g. GitLab.

Midnighter commented 4 years ago

IMO, one of the major advantages of using GitHub for hosting GEMs is the transparent and well-documented curation process, which is essential for long-term evolvement of the field.

Yes, I agree. This a very important feature. (And one that @zakandrewking had approached beautifully.)

cshenry commented 4 years ago

I just want to clarify questions about access limits in KBase.

So KBase is completely open and free for any user to sign up to run the tools, contribute data, and retrieve data. There are no restrictions on that, and I don’t see that ever changing.

There are light restrictions on who is allowed to contribute code via our SDK mandated upon us by DOE. DOE is somewhat finicky about who is allowed to contribute code that runs on DOE machines. Basically, this process involves filling out a form (accounts.kbase.us) to get a developer account on KBase.

All that said, while I would love to see KBase serve as a model atlas, I also see the advantages of a github based system (we use github for our ModelSEED biochemistry database). KBase does do versioning on all objects in its data store, but it doesn’t currently offer the rich tooling on tracking contributions and doing diffs that github does. There’s also nothing stopping the platforms like KBase or PathwayTools from linking deeply to a guthub resource. I know I would be interested in adding apps in KBase to automatically import from such a site if it was created.

One thing I would consider to be of utmost importance in such a site is to properly represent the genomes linked to the models. Ideally, I would prefer the see the site maintain its own internal compressed copies of GFF and FASTA files for genomes associated with any models stored there. People routinely use genome IDs… but these IDs go away or genes get recalled and it makes things difficult. I would argue a model is nearly useless without its associated genome, and finding the exact correct genome that should be mapped to a particular published model is one of my greatest pain points in trying to use these models in my own research. You could store protein sequences in the model, which would help, but without the genome, you’re still losing some provenance on where the protein came from.

edit: removed the email body.

edkerk commented 4 years ago

One thing to keep in mind is that standard-GEM does not have the ambition to replace BiGG, ModelSeed, etc., and become the de facto model respository, although users could of course still decide to also version-control and distribute their model via Git, even when generated by these other platform.

My personal experience of reading and reviewing papers is that there are many models that are not generated by any of the platforms mentioned above, but rather by COBRA, cobrapy, RAVEN, etc., using custom scripts. This is particularly the case for curation of existing models. These models now often only distributed as final SBML file in Supplementary Material (and perhaps submitted to BioModels Database). Some of them are already on GitHub, but the format and content of these repositories varies widely.

Regardless, an important aspect of standard-GEM is not that it can distribute the model in a variety of standardized formats, but also that its changes and development can be tracked in a flexible and versatile way. Model files can be complemented with scripts and data, and it does not matter what software is used. The argument is not against other platforms, but if your model (project) would benefit from version control (and I would argue most model projects do), then you should consider following standard-GEM, so that e.g. the rest of the community will find it easier to contribute and track changes. I can perfectly imagine a pull request where a model was made via ModelSEED and then updated on the GitHub repo.

We currently have/work on functions to write the correct files in the right format for COBRA/RAVEN/cobrapy, but maybe this can be expanded by having such functionality for the other platforms as well.

mihai-sysbio commented 4 years ago

I would love to hear more about the philosophy behind a "GitHub for GEMs" @zakandrewking, and how that improves upon the philosophy behind eg MEMOsys.

Reading between the lines, I see some consensus between:

an expected structure, for the interest of the community eg. adoption by different database-centric websites (BiGG, BioModels, Metabolic Atlas) and tools (openCOBRA, RAVEN, ecModels) (draft use case)

generic guidelines that can be applied for any Git-based platforms (@Hao-Chalmers)

There’s also nothing stopping the platforms like KBase or PathwayTools from linking deeply to a github resource. I know I would be interested in adding apps in KBase to automatically import from such a site if it was created (@cshenry)

standard-GEM does not have the ambition to become the de facto model repository (@edkerk)

From this perspective, Platforms vs git and GitHub becomes more of discussion around how to standardize the process of working with models openly. To this end, a standard for models on GitHub would be very useful for any sort of model-hosting infrastructure.

draeger commented 4 years ago

Just one question about line-based diff tools such as git: many standardized file formats are based on XML, which does not require a fixed order of its contained elements. For constraint-based modeling, SBML has become most effective with the package FBC (flux-balance constraints), which is only directly supported since Level 3 Version 1. While earlier SBML specifications pointed out that the order of its elements is significant, the more recent specifications (since L3V1) explicitly mention that this is no longer the case. Consequently, line-based diff tools such as git might not be able to identify and track changes if users scramble up the order of model elements. How could a "GitHub for GEMs"-approach deal with that problem?

cshenry commented 4 years ago

I would think you should create api scripts to process and check formats... which could handle sorting to make the files more diff-able. These scripts could also handle validation. Running these scripts should be a prerequisite to getting a PR accepted with modifications to a model. Running tools like memote could be bundled in to do qa/qc. This is the direction we’ve gone with ModelSEEDDatabase, which is on GitHub.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Andreas Dräger notifications@github.com Sent: Friday, August 14, 2020 5:22:11 PM To: MetabolicAtlas/standard-GEM standard-GEM@noreply.github.com Cc: cshenry chenry@mcs.anl.gov; Mention mention@noreply.github.com Subject: Re: [MetabolicAtlas/standard-GEM] GitHub versus specialized platform (database) (#15)

Just one question about line-based diff tools such as git: many standardized file formats are based on XML, which does not require a fixed order of its contained elements. For constraint-based modeling, SBML has become most effective with the package FBC (flux-balance constraints), which is only directly supported since Level 3 Version 1. While earlier SBML specifications pointed out that the order of its elements is significant, the more recent specifications (since L3V1) explicitly mention that this is no longer the case. Consequently, line-based diff tools such as git might not be able to identify and track changes if users scramble up the order of model elements. How could a "GitHub for GEMs"-approach deal with that problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/MetabolicAtlas/standard-GEM/issues/15#issuecomment-674295474, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHV6IT4ZRDQ53NYNDLTZPTSAW2JHANCNFSM4P6GVPRQ.

mihai-sysbio commented 4 years ago

@draeger the ordering of elements in SBML (L3V1+) is something we indeed need to be mindful of. From what I'm seeing at a glance, the history of the SBML file of the yeast-GEM has been well preserved wrt diffs. It's probably thanks to software tools exporting the file in the same order, whatever that order is. It sounds unlikely that tools would purposefully scramble the order. If, however, the order becomes scrambled between different releases, one could rely on tools like sbml-diff. We encourage model repositories to rely on other, more lightweight formats for versioning (eg yaml) and only export SBML on the main branch.

As @cshenry points out, as soon as there is a standard for repositories, one can create all sorts of systems for validation, eg the automated-validation branch of this repository. Alternatively, standard-GEM could provide workflow scrips for GitHub Actions that do this validation inside each repostory, if that would be interesting.

phantomas1234 commented 3 years ago

@draeger @mihai-sysbio you could rely on memote's approach on using YAML files (generated in addition to the SBML files) to facilitate easier line-based diff?

sulheim commented 3 years ago

I support using YAML files for easier diff. I guess the yaml-file can be created by a pre-commit hook that also sorts elements and annotations.

draeger commented 3 years ago

Personally, I also think that SBtab has a lot of potential. It is supposed to be compatible to SBML but provides a view to the models suitable for exchange via spreadsheet programs such as Excel. For model development, the row-based SBtab also has the advantage that it can be directly understood and read (not only by machines but also by users) and most people who work in the lab are familiar with Excel. Changes in such a format could also be understood by line-based comparison tools such as Git. SBtab could be used for model development and be exported to SBML for analysis.

Midnighter commented 3 years ago

I think so, too, there is a lot of potential for SBtab and ObjTables (which I understand as the spiritual successor to SBtab). Indeed Wolfram Liebermeister asked whether direct support for SBtab could be added to cobrapy.

mihai-sysbio commented 3 years ago

This is a very valuable discussion to have. At the moment, however, I feel it is hard to formulate action points, so I am going to convert this to a Discussion. When actionable items arise, issues can be created from the discussion.

MetabolicAtlas / standard-GEM