json / yaml export - complicated data structure

liebermeister commented 4 years ago

Hi Jonathan,

again a comment regarding file conversion (same data file, examples/biochemical_models/data.xlsx); now it concerns json and yaml export via

obj-tables convert schema.csv data.xlsx data.json obj-tables convert schema.csv data.xlsx data.yml

From the original tables, I had expected a data structure such as

[ [model_1], [compound1, compound2, compound3, ..] [reaction1, reaction2] ]

where the entries point to each other via their ids: e.g., reaction1 would have an attribute "Model", with value "e_coli", which matches the id of model1.

Now I saw that the the attributes (by which objects point to other objects) do not contain ids (or "variable names"), but the objects themselves. Specifically, the data structure starts with compound1, which contains (as an attribute) a data structure describing model1, which in turn contains all compounds and reactions (which then, again, contain "simple" instances of the model). In the end, there are all other compounds (with no model or reaction information at all.

I imagine that this data structure does the job for exporting / importing the python data objects, but it is difficult to make sense of. Can you have a look at this again and see if my solution (described above) would also work for you? (that is, representing a table by a list of relatively "flat" objects, whose attributes can be strings or lists, but not objects themselves, and which instead point to other objects via ids?)

Thank you!

All the best, Wolf

jonrkarr commented 4 years ago

This scheme won't work. These formats are not intended to be human-readable. I don' think JSON or YAML is well-suited for that purpose. The tabular formats are better suited for human readability. These formats are intended to make it easy for a machine to reconstruct a dataset, including all of the relationships. These formats cannot rely on the human-readable ids because classes aren't required to have such ids. The formats have to encode relationships more generically. The formats also have to communicate information about the type of each object so the files are independent from schemas.

liebermeister commented 4 years ago

Thank you!! I see the point.

The problem I have is that I thought I could import the json / yaml files into matlab and directly obtain a good data structure. At the moment, the way information is arranged seems a bit arbitrary (information about model(s), compounds, and reactions is scattered over the tree structure, some information (e.g. id: e_coli for the model) appears many times, while other information is not duplicated at all.

What I don't understand: in the yaml tree, each object already has an id and a type (e.g., "id: 0" and "type: Compound" for the fructose 6 phosphate). So why can't these objects be ordered in lists (one list for each type, and within the list, objects would be ordered by their id)? That would be much closer to what I had in mind.

Basically, I think it would be nice to have a "symmetric" data structure, in which all compounds appear in the same way, reactions appear in the same way, and so on. When importing the yaml structure into matlab, I will have to convert it into such a form - so why not also structure the yaml file like this?

Maybe we can talk in the next few days?

All the best, Wolf

jonrkarr commented 4 years ago

Because JSON can't represent circular relationships among objects and because JSON doesn't support custom classes, we have to use custom codes and design choices to encode ObjTables data into JSON and decode ObjTables data out of JSON. Even if we change how the data is encoded into JSON, we'd still need custom code to decode the type information and references. This encoding/decoding is encapsulated by the obj_tables.io.JsonReader and obj_tables.io.JsonWriter classes. Because these design choices are encapsulated, and because the JSON isn't intended to be human-readable, how the data is encoded into JSON isn't important to users.

My thinking is that obj_tables.io.JsonReader needs to be implemented in MATLAB (and any other language where obj_tables is used). Essentially, this means implementing a version of obj_tables.io.JsonReader in MATLAB (~100 lines of code).

liebermeister commented 4 years ago

OK, I see. I will try to implement the JSON reader in MATLAB (without validation, ie assuming that the json file has been directly generated by obj_tables).

Do you think we need a JSON writer class for matlab? Probably not, right? The matlab -> python direction can always be accomplished through csv I guess.

jonrkarr commented 4 years ago

JsonReader

If you assume that the data has already been validated, then the JSON doesn't need to be further validated. I wouldn't recommend implementing table parsing or validation in MATLAB. Even with MATLAB object-oriented programming, this would likely take more than 10,000 lines of code.

JsonWriter

Do you want to programmatically create objects in MATLAB (i.e., programmatically generate species, reactions, rate laws, etc.)? If so, a JsonWriter class could be useful. Then you could create structs that represent species and reactions, convert them to JSON, and use ObjTables to write them to tables rather than writing the same information directly to tables with MATLAB (structs are easier to manipulate than tables).

However, making this really useful would require implementing MATLAB classes for each type of object (a class for each table and each relationship) rather than using structs. ObjTables uses Python metaprogramming to make such classes easy to generate. This is implemented in ~1,000 lines of code. Because MATLAB doesn't have metaprogramming, this would likely take at least an order of magnitude more code.

liebermeister commented 4 years ago

I think matlab classes are not necessary.

I would like to continue working with structs, for example one structure than I'm using already: a document is a struct containing the tables ("models"); each table would is a struct containing the table rows ("objects"); and each table of table is a struct containing the individual table cells ("attributes").

This structure could be directly converted into JSON, but it would not be the "asymmetric" JSON structure used by obc_tables, which I would not know how to generate structure reliably.

Another - rather pragmatic - option for matlab would be, given a csv file to be imported, to python to validate the file, and if the file is correct, to simply read the original csv file into matlab (knowing that it is correct). Then I can easily generate my own matlab structs. For validating an SBtab document, I export it to a csv and then run the python validator.

Do you think this makes sense?

jonrkarr commented 4 years ago

MATLAB classes aren't necessary, but its easier to build a user-friendly interface with custom classes. Having such an interface is more important when there are relationships among objects that need to be managed and when there are more data types. This is less necessary for SBtab since it largely ignores relationships and only has a few data types.

Yes, the ObjTables validation could be accessed by (1) saving structs to CSV and (2) using ObjTables to validate the CSV.

jonrkarr commented 4 years ago

I looked at the JSON output again to remind myself how I designed it. Its pretty simple. Its a flat list of objects. The type of each object is indicated by the key __type. Each object is also assigned an internal id (__id), which is used to encode relationships among objects. This __id should be used to decode the relationships. To decode the relationships, you need to know which attributes represent relationships. This can be obtained from the schema.

If you wish to have a more hierarchical structure, you can group the objects based on __type. I can add an option to the Python code to return the objects grouped by type (as well as to read in objects encoded in this alternative encoding).

liebermeister commented 4 years ago

No, that sounds good .. but it doesn't really match what I see. Here's the yaml code I obtain (which, I expect, has the same struture as the JSON code): it has several levels, with some information appearing multiple times. Can you check again if this is there structure you meant to design?

id: 0 type: Compound id: D_Fructose_6_phosphate identifiers: kegg.compound::C00085 is_constant: true model: id: 1 type: Model compounds:
- id: 0 type: Compound id: D_Fructose_6_phosphate
- id: 2 type: Compound id: D_Glucose identifiers: kegg.compound::C00031 is_constant: true model: id: 1 type: Model id: e_coli name: D-Glucose
- id: 3 type: Compound id: D_Glucose_6_phosphate identifiers: kegg.compound::C00092 is_constant: true model: id: 1 type: Model id: e_coli name: D-Glucose 6-phosphate
- id: 4 type: Compound id: Phosphoenolpyruvate identifiers: kegg.compound::C00074 is_constant: true model: id: 1 type: Model id: e_coli name: Phosphoenolpyruvate
- id: 5 type: Compound id: Pyruvate identifiers: kegg.compound::C00022 is_constant: true model: id: 1 type: Model id: e_coli name: Pyruvate id: e_coli name: '' reactions:
- id: 6 type: Reaction equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate gene: PGI id: PGI_R02740 identifiers: kegg.reaction::R02740 is_reversible: true model: id: 1 type: Model id: e_coli name: ''
- id: 7 type: Reaction equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate gene: PTS id: PTS_RPTSsy identifiers: kegg.reaction::RPTSsy is_reversible: true model: id: 1 type: Model id: e_coli name: '' name: D-Fructose 6-phosphate
id: 2 type: Compound id: D_Glucose
id: 3 type: Compound id: D_Glucose_6_phosphate
id: 4 type: Compound id: Phosphoenolpyruvate
id: 5 type: Compound id: Pyruvate
id: 1 type: Model id: e_coli
id: 6 type: Reaction id: PGI_R02740
id: 7 type: Reaction id: PTS_RPTSsy

jonrkarr commented 4 years ago

How did you generate this?

jonrkarr commented 4 years ago

The information isn't repeated. What looks like repetition is the encoding of a relationship.

liebermeister commented 4 years ago

I generated this by

obj-tables convert schema.csv data.xlsx data.yml

with the data files from obj_tables/examples/biochemical_models

Ok, mayeb it's necessary to repeat this, but I thought that writing

model: id: 1 type: Model id: e_coli

multiple times is redundant, because, for example

model: __id: 1

should do the job. But my main worry is not the repetition, but the fact that a lot of information about model, compounds, and reactions appears inside the first compound element, and not where I would expect it - in the respective elements in the outer list.

jonrkarr commented 4 years ago

I flattened out the encoding.

Example:

- __id: 0
  __type: Compound
  id: D_Fructose_6_phosphate
  identifiers: kegg.compound::C00085
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Fructose 6-phosphate
- __id: 1
  __type: Compound
  id: D_Glucose
  identifiers: kegg.compound::C00031
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose
- __id: 2
  __type: Compound
  id: D_Glucose_6_phosphate
  identifiers: kegg.compound::C00092
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose 6-phosphate
- __id: 3
  __type: Compound
  id: Phosphoenolpyruvate
  identifiers: kegg.compound::C00074
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Phosphoenolpyruvate
- __id: 4
  __type: Compound
  id: Pyruvate
  identifiers: kegg.compound::C00022
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Pyruvate
- __id: 5
  __type: Model
  compounds:
  - __id: 0
    __type: Compound
    id: D_Fructose_6_phosphate
  - __id: 1
    __type: Compound
    id: D_Glucose
  - __id: 2
    __type: Compound
    id: D_Glucose_6_phosphate
  - __id: 3
    __type: Compound
    id: Phosphoenolpyruvate
  - __id: 4
    __type: Compound
    id: Pyruvate
  id: e_coli
  name: ''
  reactions:
  - __id: 6
    __type: Reaction
    id: PGI_R02740
  - __id: 7
    __type: Reaction
    id: PTS_RPTSsy
- __id: 6
  __type: Reaction
  equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate
  gene: PGI
  id: PGI_R02740
  identifiers: kegg.reaction::R02740
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''
- __id: 7
  __type: Reaction
  equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate
  gene: PTS
  id: PTS_RPTSsy
  identifiers: kegg.reaction::RPTSsy
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''

jonrkarr commented 4 years ago

I think can make this is a bit more flexible so that the structure of the JSON/YML can be controlled by the user:

Make JsonWriter serialize any instance of Model, list of instances of Model, or dictionary which contains instances of Model.
Make JsonReader deserialize anything created by JsonWriter

This will give the user control over the many semantically-equivalent ways of encoding the same data into JSON/YML.

Then I can use this to generate JSON with the structure you're expecting.

That said, I don't think its necessary to make the JSON human-readable. The JSON just has to capture the semantic meaning of the objects and their relationships.

jonrkarr commented 4 years ago

The relationships to other objects are represented as dictionaries. E.g.,

{
"__id": 7,
"__type": "Reaction",
id: PTS_RPTSsy
}

The only information that must be included is __id. I have chosen to include __type and the primary attribute (e.g., id) because I think this makes it more readable. However, this isn't necessary.

liebermeister commented 4 years ago

Fantastic!!! That completely solves the problem I had.

Thank you!

jonrkarr commented 4 years ago

The default output will now be grouped by class as illustrated at the bottom of this comment. This structure is more similar to the way that objects of different types are represented by different tables.

FYI, the Python code which generates the JSON/YAML is more flexible than this:

It can encode any object that is composed of instances of Model, list, dict, and scalars (None, str, bool, int, float) into JSON/YAML.
This allows customization of how objects are encoded into JSON/YAML.
In particular, it allows extra information to be encoded into JSON/YAML. This information could be thought of as the analog of the comments in tables.

The Python code which decodes the JSON/YAML is equally flexible:

Regardless how objects are encoded in JSON/YAML, they can be converted into tables. In that case, all other data (i.e., the "comments") is ignored.

Note, this flexibility is not extended to the command line program and REST API. The command line program and REST API can only encode data into JSON/YAML as illustrated below. I don't think it makes sense to extend this flexibility to the command line program and REST API; this would require users to specify the output format, which seems unnecessarily complicated. One thing that would be easy to extend to the command line program and REST API would be an option to encode the data into JSON/YAML as a flat list (as illustrated 3 comments above) rather than as a dictionary.

Compound:
- __id: 0
  __type: Compound
  id: D_Fructose_6_phosphate
  identifiers: kegg.compound::C00085
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Fructose 6-phosphate
- __id: 1
  __type: Compound
  id: D_Glucose
  identifiers: kegg.compound::C00031
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose
- __id: 2
  __type: Compound
  id: D_Glucose_6_phosphate
  identifiers: kegg.compound::C00092
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: D-Glucose 6-phosphate
- __id: 3
  __type: Compound
  id: Phosphoenolpyruvate
  identifiers: kegg.compound::C00074
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Phosphoenolpyruvate
- __id: 4
  __type: Compound
  id: Pyruvate
  identifiers: kegg.compound::C00022
  is_constant: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: Pyruvate
Model:
- __id: 5
  __type: Model
  compounds:
  - __id: 0
    __type: Compound
    id: D_Fructose_6_phosphate
  - __id: 1
    __type: Compound
    id: D_Glucose
  - __id: 2
    __type: Compound
    id: D_Glucose_6_phosphate
  - __id: 3
    __type: Compound
    id: Phosphoenolpyruvate
  - __id: 4
    __type: Compound
    id: Pyruvate
  id: e_coli
  name: ''
  reactions:
  - __id: 6
    __type: Reaction
    id: PGI_R02740
  - __id: 7
    __type: Reaction
    id: PTS_RPTSsy
Reaction:
- __id: 6
  __type: Reaction
  equation: -1.0 D_Glucose_6_phosphate; 1.0 D_Fructose_6_phosphate
  gene: PGI
  id: PGI_R02740
  identifiers: kegg.reaction::R02740
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''
- __id: 7
  __type: Reaction
  equation: -1.0 D_Glucose; -1.0 Phosphoenolpyruvate; 1.0 Pyruvate; 1.0 D_Glucose_6_phosphate
  gene: PTS
  id: PTS_RPTSsy
  identifiers: kegg.reaction::RPTSsy
  is_reversible: true
  model:
    __id: 5
    __type: Model
    id: e_coli
  name: ''

jonrkarr commented 4 years ago

Here's example code for decoding JSON in another language without access to the obj_tables Python package (< 50 lines): https://github.com/KarrLab/obj_tables/tree/master/examples/decode_json.py

Here's the unit test for the code: https://github.com/KarrLab/obj_tables/blob/master/tests/test_examples.py#L222

For MATLAB,

list should be replaced by cellarray
struct should be replaced by a custom class which behaves like a struct but also supports handles (references/pointer). The class below should work (copied from MATLAB central), although my MATLAB knowledge is rusty.

classdef hstruct < handle
  properties
    data
  end

  methods
    function obj = hstruct(data)
      obj.data = data;
    end
  end
end

KarrLab / obj_tables

json / yaml export - complicated data structure #102

JsonReader

JsonWriter