gafusion / omas

Ordered Multidimensional Array Structure
http://gafusion.github.io/omas
MIT License
30 stars 14 forks source link

Tool for converting to nested schemas #215

Closed ZedThree closed 1 year ago

ZedThree commented 1 year ago

I've written a small tool for converting the flat OMAS schemas to nested JSON-schema. This is useful for ingesting the schema in other tools (in my case, Invenio-RDM), or further converting to other ORMs or even Python dataclasses.

Here's a (very cut down) sample of what the output looks like:

"code": {
    "description": "Generic decription of the code-specific parameters for the code that has produced this IDS",
    "properties": {
        "commit": {
            "description": "Unique commit reference of software",
            "type": "string"
        },
        "library": {
            "description": "List of external libraries used by the code that has produced this IDS",
            "items": {
                "properties": {
                    "commit": {
                        "description": "Unique commit reference of software",
                        "type": "string"
                    },
                    "name": {
                        "description": "Name of software",
                        "type": "string"
                    },
                },
                "type": "object"
            },
            "type": "array"
        },
        "name": {
            "description": "Name of software generating IDS",
            "type": "string"
        },
        "version": {
            "description": "Unique version (tag) of software",
            "type": "string"
        }
    },
    "type": "object"
}

Would this be useful to add to OMAS, or is it too specialised?

smithsp commented 1 year ago

What is the difference between this and the json output omas currently produces? You say flat, but do you just mean indentation?

ZedThree commented 1 year ago

To be clear, this is only about the schema, not output.

Compare the above to the equivalent part of the gyrokinetics schema (again, heavily cut down for illustration):

 "gyrokinetics.code": {
  "data_type": "STRUCTURE",
  "documentation": "Generic decription of the code-specific parameters for the code that has produced this IDS",
 },
 "gyrokinetics.code.commit": {
  "data_type": "STR_0D",
  "documentation": "Unique commit reference of software",
 },
 "gyrokinetics.code.library": {
  "coordinates": [
   "1...N"
  ],
  "data_type": "STRUCT_ARRAY",
  "documentation": "List of external libraries used by the code that has produced this IDS",
 },
 "gyrokinetics.code.library[:].commit": {
  "data_type": "STR_0D",
  "documentation": "Unique commit reference of software",
 },
 "gyrokinetics.code.library[:].name": {
  "data_type": "STR_0D",
  "documentation": "Name of software",
 },
 "gyrokinetics.code.name": {
  "data_type": "STR_0D",
  "documentation": "Name of software generating IDS",
 },
 "gyrokinetics.code.version": {
  "data_type": "STR_0D",
  "documentation": "Unique version (tag) of software",
 },

This version is flat because all of the properties are top-level keys. If you want to know what properties gyrokinetics.code can have, you need to iterate over all of the keys in the schema, parse them to find those that contain .code., and do that iteratively, to eventually find ["commit", "library", "name", "version"].

In the nested version in my first comment, they are exactly the keys of schema["code"]["properties"].

JSON-schema is a standard format with tools in lots of different languages, so one could then use it to validate IMAS files in C++, for instance.

smithsp commented 1 year ago

Thanks. I see the difference. @orso82 will have to comment on why he chose the flat schema for storage of the standard.

smithsp commented 1 year ago

Although, thinking more, the downside of the nested schema is that if you look for a variable R, you will find it in many different places, and you don't know which part of the structure you are in when looking at the text version of the nested json without manually picking your way up the hierarchy.

ZedThree commented 1 year ago

True, so you'd probably want to still keep the flattened version for documentation -- or generate more structured documentation, perhaps? Again, that would probably also be easier to generate from the nested version.

orso82 commented 1 year ago

Sorry for the delay here. I got distracted and forgot to reply. The reason OMAS stores data in a flattened JSON is:

  1. Faster/easier to find info (e.g. with the omas_info_node() function)
  2. Easier to read and edit (e.g. extension of the data dictionary via the add_extra_structures() https://gafusion.github.io/omas/auto_examples/extra_structures.html#sphx-glr-auto-examples-extra-structures-py)

That said, @ZedThree if you have written up a function, we could make it available as a utility in OMAS. Feel free to open a pull request!

github-actions[bot] commented 1 year ago

Stale issue message