MetabolicAtlas / standard-GEM

The standard for open-source GEMs on GitHub
https://www.biorxiv.org/content/10.1101/2023.03.21.512712
Creative Commons Attribution 4.0 International
18 stars 5 forks source link

Encourage the use of dedicated fields rather than semantic overloading #27

Open draeger opened 3 years ago

draeger commented 3 years ago

Description of the issue:

Traditionally, identifiers have been used to encode various types of information, e.g., a metabolite's compartment. While this approach might lead to recognizable displays, it causes several problems in information storage and is an avoidable source of error and ongoing debates. It should be noted that display names and scientific names can alternatively be encoded in different ways.

Expected feature/value/output:

Working with models would become easier if identifiers could be treated as what they are: unique String keys to identify particular objects within a model. We should avoid expecting anything else from them. Using them as display-names or for any further semantic information often leads to misunderstanding and complicated parsing routines. Trying to validate their assumed structure can lead to incompatibilities to models from scientists outside of the COBRA community.

Current feature/value/output:

Let's use only dedicated fields to store the semantics of a model. For instance, the compartmental localization of a metabolite can unambiguously be stored using the species' compartment attribute, so that we don't need to rely on any suffix of an identifier. If a display name is needed as an abbreviation, it can, for instance, be stored using the new key-value pair structure in the upcoming FBC version 3 package. To avoid redundancy, one could, for example, keep an abbreviation for a metabolite (e.g., glc) and algorithmically append _c, [c], or any other representation for the compartment depending on user preferences.

Reproducing these results:

Many sources of error could be eliminated. Identifier parsing is cumbersome and error-prone. It should be avoided. It is complicated and compromises the reproducibility and interoperability of models.

JonathanRob commented 3 years ago

Very good points, @draeger. To be sure that I understand correctly, are you suggesting that metabolite identifiers in a model should not contain any compartment information? If so, I suspect this would create problems for software/formats that expect or require that the metabolite identifiers be unique.

Or are you suggesting that the metabolite identifiers, which may contain a compartment abbreviation, should not be used to extract such information?

Midnighter commented 3 years ago

I think it's a good idea to propose a change like this for a forward looking recommendation like standard-GEM. Even though it will make it even harder for tools to read since there will be thousands of "old-style" models and a few new ones :stuck_out_tongue_closed_eyes:

draeger commented 3 years ago

@JonathanRob, an ID can, of course, contain any prefix, infix, or suffix that is commonly allowable. If it is an abbreviation for a compartment, that is fine. However, a software should not rely on a specific pattern for an identifier. If a metabolite has the ID _452oijrkjna095486foherl it is as valid as M_glc_6_p_c. The ID is not a display name. If a short, human-readable display name is needed, it can either be generated from the information stored in the object or stored as name or within the key-value pair list.

JonathanRob commented 3 years ago

Ok, yes, I absolutely agree.

edkerk commented 3 years ago

@draeger Very good point. Besides the case of compartment information in metabolite IDs, are there other cases that you are aware of?

@Midnighter Reading should be fine, use of compartment field is part of SBML L3V1 + FBCv2 (maybe even earlier?) and at least COBRA toolbox, cobrapy and RAVEN can deal with this. Once loaded, however, COBRA still appends the compartment information to metabolite ID instead of having a dedicated compartment field, but there has been some progress on including a metComps fields: https://github.com/opencobra/cobratoolbox/pull/1396. When that will be released, COBRA can import->export models without changing metabolite IDs whatsoever. I don't know what is the status of cobrapy.

Midnighter commented 3 years ago

I think @matthiaskoenig would have some ideas on this topic, too.

matthiaskoenig commented 3 years ago

Not sure what to add here.

Midnighter commented 3 years ago

I think it's worth talking about how to best organize "display names" or "symbols" for model elements. I know that the interactivity offered by using most of the BiGG identifiers in cobrapy interactively is a major selling point for it. (Basically, being able to call methods on an element such as model.metabolites.atp_c is super convenient for exploration.)

draeger commented 3 years ago

We have three different things that people seem to like to represent with identifiers or names:

  1. An actual identifier for specific objects, e.g., metabolites,
  2. An abbreviated display name that reflects a summarized version of key attribute values,
  3. In contrast, a scientific name that is typically relatively long.

Now, the identifier could be an arbitrary String, e.g., e83oiha24vuiase2 or so. The abbreviated display name (e.g., M_glc_c) could be stored in at least two different ways in SBML:

  1. Use the name attribute that all objects in SBML have now. When doing so, the scientific name has to go somewhere else. As it turns out, there is often more than one scientific name. Let's stick to the example with metabolites. They usually have a lot of synonyms. If MIRIAM annotations are used, any scientific name can be retrieved on the fly from linked resources, such as KEGG, ChEBI, or BiGG via REST APIs or similar mechanisms. Alternatively, one could store the scientific name in a key-value pair field.
  2. If using the new key-value pair list from FBC version 3 for encoding the abbreviated name, then the preferred scientific name can be stored in the name attribute.

In any case, I'd recommend only keeping a very fundamental abbreviation and algorithmically appending any information, such as the compartment code, because if the model is changed, that information is likely to get out of sync. If only an abbreviation, such as glc for D-glucose, is stored in the key-value list, all other information can be obtained from the object, e.g., the compartment.

As @matthiaskoenig points out, if a software internally stores multiple different kinds of information in one variable at once, this is a tool problem and should be solved by adding additional variables to that software (which is much easier to do than complicated and highly error-prone String processing).

To answer @edkerk's question: There can be many different things encoded in some identifiers, such as compartment, tissue, reaction types, etc.. Here is an overview: https://github.com/SBRG/bigg_models/wiki/BiGG-Models-ID-Specification-and-Guidelines.

matthiaskoenig commented 3 years ago

I know that the interactivity offered by using most of the BiGG identifiers in cobrapy interactively is a major selling point for it.

@Midnighter Yes, it would be a selling point if it would work. With the current Bigg identifiers this does not work consistently (see https://github.com/opencobra/cobrapy/issues/828). With valid SBML identifiers code completion just works, you just use model.metabolites.M_3pg_c. Another reason not to add/remove prefixes. If you use SBML SIds as they are without any changes they work as variable names and in autocompletion, e.g. in ipython.

SBML SIds are valid variable names! This is one of the big selling points for them.

Storing the additional information is not too complicated. One can use the name attribute for a long display name and store an additional label (short-name) for instance by using key-value pairs.

Midnighter commented 3 years ago

Storing the additional information is not too complicated. One can use the name attribute for a long display name and store an additional label (short-name) for instance by using key-value pairs.

I strongly suggest that if standard-GEM recommends using key-value pairs to also recommend one specific key. I don't care what it's called but better to settle on one key early on.

matthiaskoenig commented 3 years ago

@Midnighter The complete key-value pair feature is already in the open pull request for cobrapy from GSOC (https://github.com/Hemant27031999/cobrapy/blob/metadata_fbc3_group/src/cobra/core/metadata/keyvaluepairs.py) Importantly the Key-Value-Pairs have a URI attribute which solves exactly the issue of different keys everywhere. I.e. the URI allows to link to a set of defined keys with explanation. The fbc community just has to decide which keys to use and what they exactly mean, then we can define the URI for the keys and the meaning is unique throughout the community.