More accessible format of model annotations

JonathanRob commented 4 years ago

Description of the issue:

Currently, the model annotation information (e.g., reaction and metabolite KEGG IDs, metabolite HMDB IDs, etc.) is stored in the SBML version (.xml) of the model, or the annotation .JSON files: humanGEMMetAssoc.JSON and humanGEMRxnAssoc.JSON.

Although the information can be retrieved by loading the JSON files into Matlab, the JSON format does not seem to be the most intuitive or human-readable format for such information. I think it would be convenient to instead store this data in a tabular format (e.g., .tsv) that can easily be read in its raw form, imported by other languages (python, R), or viewed in Excel, for example. Furthermore, if a change is made to a reaction or metabolite (or a metabolite is deleted/added), it would be nice if it all ends up on a single line, rather than being spread across multiple lines.

I don't recall what the original justification of using the .JSON format was, so please let me know if I'm missing a major advantage. Also, I don't think standard-GEM yet advises on how to format such information.

My suggestion would be to change the .JSON files to a .tsv or similar format. However, I know that this would likely require reworking some pipelines, namely with importing the data into Metabolic Atlas. There is also an issue (#157) suggesting that we convert the gene annotation file from .tsv to .JSON - I fully support having these all in one location/format, but why not .tsv?

Another option (or in addition to the one above) could be to enhance the export of the Excel model version so that all the annotation information is appended as additional columns in that document. That would mean that we continue using the same .JSON annotation files, but users could instead extract the data from the .xlsx file (though I am generally hesitant to trust Excel not to silently change content, such as gene names to dates).

Please let me know what your thoughts are on this; in particular @Hao-Chalmers since you established the JSONs, and @pecholleyc since this will affect import into Metabolic Atlas, and @mihai-sysbio if anything in the context of standard-GEM relates to this issue.

Expected feature/value/output:

A more accessible/human-readable annotation file format.

I hereby confirm that I have:

[X] Checked that a similar issue does not exist already

haowang-bioinfo commented 4 years ago

@JonathanRob the initial adoption of JSON format for archiving annotation info was proposed by #75, based on following considerations:

open-standard plaintext format
language independent, ease import/export by Matlab, Python and R

Btw, I'm open to changes either to .tsv or any other format, if more convenient.

mihai-sysbio commented 4 years ago

Thanks to the refactoring of the data import process into Metabolic Atlas done by @pecholleyc, it is much easier now to adapt to new formats. The latest import format for model annotation is indeed .tsv like below:

@id db_name ext_id link m00001c BiGG carveol https://identifiers.org/bigg.metabolite:carveol m00001c KEGG C00964 https://identifiers.org/kegg.compound:C00964 m00001c ChEBI CHEBI:15389 https://identifiers.org/CHEBI:15389 m00001c Recon3D carveol https://identifiers.org/vmhmetabolite:carveol m00001c MetaNetX MNXM45735 https://identifiers.org/metanetx.chemical:MNXM45735 m00001s BiGG carveol https://identifiers.org/bigg.metabolite:carveol

From the perspective of standard-GEM, the closest recommendation at the moment is to use the sbtab format.

I believe the issue with the json format is its current structure, where all ids of the same type are grouped together, instead of grouping by reaction/metabolite. Perhaps there are other Matlab built-in ways of obtaining a json in addition to cell array.

haowang-bioinfo commented 4 years ago

I wonder if #157 can be continued or wait until this issue is settled?

JonathanRob commented 4 years ago

@Hao-Chalmers we can continue with #157, though I'm still in support of converting all such annotation files (genes, metabolites, and reactions) to a tsv-like format. We/I have received many questions and requests from users for association and annotation information, which tells me that the JSON format is not very intuitive for people to use.

haowang-bioinfo commented 4 years ago

Taking steps usually makes good progress, so I agree to firstly solve #157, then #203.

The JSON format files were initially adopted for storing annoation info. They work pretty well in script-based manipulation ( by Matlab and Python) and tracking changes on GH. So far, a workflow has evolved with these JSON files and associated functions/scripts. Therefore, tsv format annotation files can be conveniently, routinely and automatically extracted from these JSON files with every release.

Not sure whether to deposit tsv annotation files to Human-GEM repo, or Metabolic Atlas which was designed for intuitive viewing purpose.

mihai-sysbio commented 4 years ago

The way I see things, I would recommend tsv annotation files to be stored in the repository, such that any other website/tool can have access to it. Ideally, such a recommendation would come from an approach like standard-GEM.

Repeating Jon's point, is it really necessary to have both json and tsv though? If yes, I would suggest that both are derived from the same source, rather than one being derived from the other.

JonathanRob commented 4 years ago

@mihai-sysbio I completely agree - duplicating this information in two locations (formats) risks that they become inconsistent, requiring even further consistency checks.

Since there is no "source" to generate the annotation files from, my suggestion is to convert the json files to tsv format, update any functions/workflow that uses these json files to instead use the tsv files, then delete the json files. The tsv files then effectively serve as an "annotation database" for Human-GEM, as did the json files.

JonathanRob commented 4 years ago

If there are no objections, I volunteer to implement the json --> tsv conversion, including updating any functions that use these files.

We can have a brief "transition period" where we have both json and tsv files, after which the json files are removed, if that would be preferred.

JonathanRob commented 3 years ago

This issue was resolved in PR #212.

SysBioChalmers / Human-GEM

More accessible format of model annotations #203

Description of the issue:

Expected feature/value/output: