Closed JonathanRob closed 3 years ago
@JonathanRob the initial adoption of JSON format for archiving annotation info was proposed by #75, based on following considerations:
Btw, I'm open to changes either to .tsv
or any other format, if more convenient.
Thanks to the refactoring of the data import process into Metabolic Atlas done by @pecholleyc, it is much easier now to adapt to new formats. The latest import format for model annotation is indeed .tsv
like below:
@id db_name ext_id link m00001c BiGG carveol https://identifiers.org/bigg.metabolite:carveol m00001c KEGG C00964 https://identifiers.org/kegg.compound:C00964 m00001c ChEBI CHEBI:15389 https://identifiers.org/CHEBI:15389 m00001c Recon3D carveol https://identifiers.org/vmhmetabolite:carveol m00001c MetaNetX MNXM45735 https://identifiers.org/metanetx.chemical:MNXM45735 m00001s BiGG carveol https://identifiers.org/bigg.metabolite:carveol
From the perspective of standard-GEM
, the closest recommendation at the moment is to use the sbtab format.
I believe the issue with the json
format is its current structure, where all ids of the same type are grouped together, instead of grouping by reaction/metabolite. Perhaps there are other Matlab built-in ways of obtaining a json
in addition to cell array.
I wonder if #157 can be continued or wait until this issue is settled?
@Hao-Chalmers we can continue with #157, though I'm still in support of converting all such annotation files (genes, metabolites, and reactions) to a tsv-like format. We/I have received many questions and requests from users for association and annotation information, which tells me that the JSON
format is not very intuitive for people to use.
Taking steps usually makes good progress, so I agree to firstly solve #157, then #203.
The JSON
format files were initially adopted for storing annoation info. They work pretty well in script-based manipulation ( by Matlab and Python) and tracking changes on GH. So far, a workflow has evolved with these JSON
files and associated functions/scripts. Therefore, tsv
format annotation files can be conveniently, routinely and automatically extracted from these JSON
files with every release.
Not sure whether to deposit tsv
annotation files to Human-GEM repo, or Metabolic Atlas which was designed for intuitive viewing purpose.
The way I see things, I would recommend tsv
annotation files to be stored in the repository, such that any other website/tool can have access to it. Ideally, such a recommendation would come from an approach like standard-GEM
.
Repeating Jon's point, is it really necessary to have both json
and tsv
though? If yes, I would suggest that both are derived from the same source, rather than one being derived from the other.
@mihai-sysbio I completely agree - duplicating this information in two locations (formats) risks that they become inconsistent, requiring even further consistency checks.
Since there is no "source" to generate the annotation files from, my suggestion is to convert the json
files to tsv
format, update any functions/workflow that uses these json
files to instead use the tsv
files, then delete the json
files. The tsv
files then effectively serve as an "annotation database" for Human-GEM, as did the json
files.
If there are no objections, I volunteer to implement the json
--> tsv
conversion, including updating any functions that use these files.
We can have a brief "transition period" where we have both json
and tsv
files, after which the json
files are removed, if that would be preferred.
This issue was resolved in PR #212.
Description of the issue:
Currently, the model annotation information (e.g., reaction and metabolite KEGG IDs, metabolite HMDB IDs, etc.) is stored in the SBML version (.xml) of the model, or the annotation
.JSON
files: humanGEMMetAssoc.JSON and humanGEMRxnAssoc.JSON.Although the information can be retrieved by loading the JSON files into Matlab, the JSON format does not seem to be the most intuitive or human-readable format for such information. I think it would be convenient to instead store this data in a tabular format (e.g.,
.tsv
) that can easily be read in its raw form, imported by other languages (python, R), or viewed in Excel, for example. Furthermore, if a change is made to a reaction or metabolite (or a metabolite is deleted/added), it would be nice if it all ends up on a single line, rather than being spread across multiple lines.I don't recall what the original justification of using the
.JSON
format was, so please let me know if I'm missing a major advantage. Also, I don't think standard-GEM yet advises on how to format such information.My suggestion would be to change the
.JSON
files to a.tsv
or similar format. However, I know that this would likely require reworking some pipelines, namely with importing the data into Metabolic Atlas. There is also an issue (#157) suggesting that we convert the gene annotation file from.tsv
to.JSON
- I fully support having these all in one location/format, but why not.tsv
?Another option (or in addition to the one above) could be to enhance the export of the Excel model version so that all the annotation information is appended as additional columns in that document. That would mean that we continue using the same
.JSON
annotation files, but users could instead extract the data from the.xlsx
file (though I am generally hesitant to trust Excel not to silently change content, such as gene names to dates).Please let me know what your thoughts are on this; in particular @Hao-Chalmers since you established the JSONs, and @pecholleyc since this will affect import into Metabolic Atlas, and @mihai-sysbio if anything in the context of
standard-GEM
relates to this issue.Expected feature/value/output:
A more accessible/human-readable annotation file format.
I hereby confirm that I have: