ess-dmsc / kafka-to-nexus

(Mirror: moved to https://gitlab.esss.lu.se/ecdc/ess-dmsc/kafka-to-nexus) Writes NeXus files from experiment data streamed through Apache Kafka.
BSD 2-Clause "Simplified" License
2 stars 4 forks source link

JSON schema format #17

Closed matthew-d-jones closed 6 years ago

matthew-d-jones commented 7 years ago

I have been looking at your example JSON, NeXus file schemas and have a few questions and comments.

Here is an example format which would solve many of the potential problems at the expense of greater verbosity


{
  "name": "raw_data_1",
  "type": "group",
  "attributes": [{"NX_class": "NXentry"}],
  "children": [
    {
      "name": "instrument_1",
      "type": "group",
      "attributes": [{"NX_class": "NXinstrument"}],
      "children": [
        {
          "name": "detector_1",
          "type": "group",
          "attributes": [{"NX_class": "NXdetector"}],
          "children": [
            {
              "name": "pixel_shape",
              "type": "group",
              "attributes": [{"NX_class": "NXsolid_geometry"}],
              "children": [
                {
                  "name": "cylinder",
                  "type": "uint32",
                  "values": [0, 1, 2]
                },
                {
                  "name": "vertices",
                  "type": "float",
                  "values": [[-0.001, 0, 0], [0, 0.004, 0], [0.001, 0, 0]],
                  "attributes": [{"units": "m"}]
                }
              ]
            }
          ]
        },
        {
          "name": "detector_2",
          "type": "group",
          "attributes": [{"NX_class": "NXdetector"}],
          "children": [
            {
              "name": "pixel_shape",
              "type": "hardlink",
              "path": "/raw_data_1/instrument_1/detector_1/pixel_shape"
            }
          ]
        }
      ]
    }
  ]
}

For the attention of @zjttoefs @SkyToGround @dominikwerder @michele-brambilla

dominikwerder commented 7 years ago

I don't mind extra verbosity and I like the explicit "children" key in your proposal. As far as I see, the "attributes" key is functionally the same as the "NX_attributes" key in the current schema, and I like the shorter "attributes". You're right that attributes were so far possible to specify in the json for groups.

So far, the shape and type of a dataset was inferred from the first data packet. This will change because we want to specify the type and shape of a dataset up-front without waiting for first data for type-deduction. I therefore also added a "type" key.

After merging with your naming, a dataset could look something like this:

{
  "name": "a-simple-dataset",
  "attributes": [
    {"NX_class": "NXdetector"},
    {"this_will_be_a_double": 0.123},
    {"this_will_be_a_int64": 123}
  ],
  "type": {
    "space": "simple",
    "type": "uint64",
    "size": ["unlimited", 4, 2]
  }
}

If one supplies also values for the dataset as in your example one can leave the detailed "type" information out and let it deduce the shape:

{
  "name": "a-simple-dataset",
  "type": "float",
  "values": [
    [
      [1, 2],
      [2, 3],
      [3, 4],
      [4, 5]
    ],
    [
      [1, 2],
      [2, 3],
      [3, 4],
      [4, 5]
    ]
  ]
}

Like you mentioned, ragged arrays would not be fully compatible. I would say that it is a error. Or is there a use case where one wants to partially fill a non-unlimited dimension?

Also, when supplying values in the json without specifying the exact shape, did you think about this dataset as being still unlimited along the major dimension by default?

A scalar dataset could be simply

{
  "name": "a-scalar-dataset",
  "type": {
    "space": "scalar",
    "type": "double"
  },
  "values": 42.0
}

What is your opinion?

matthew-d-jones commented 7 years ago

ragged arrays would not be fully compatible. I would say that it is a error.

Exactly, I just wanted to point out that there is some error handling to do, due to the differences.

when supplying values in the json without specifying the exact shape, did you think about this dataset as being still unlimited along the major dimension by default?

I think my expectation would be for the shape to match the provided values by default.

I'm not sure what the role of space is. Is there a difference between "size": [1] and "space": "scalar"?

I put together a small python script which outputs a JSON schema for a nexusformat tree. This can be from an existing file, or a tree built using nexusformat: https://github.com/ess-dmsc/nexus-json

SkyToGround commented 7 years ago

File writing meta data can be logically split into three groups:

The last two types of meta data will be handled by the individual file writing module and thus I do not see a problem with grouping them. I do not think that individual data sets should be specified in the JSON structure as the file writing module/plugin will some times need to create additional "helper" data sets (as described in the NeXus documentation). Thus I suggest a structure similar to something like the following:

{
    "name": "some_data",
    "type": "group",
    "attributes": {
        "NX_class": "NXlog",
        "note": "Some note"
    },
    "children": [
        {
            "name": "name_is_not_used",
            "attributes": {
                "attributes_here_are": "not_used"
            },
            "type": "file_writer",
            "settings": {
                "topic": "event_data",
                "data_name": "some_PV_or_detector_name",
                "module": "f142_v2",
                "shape": [
                    1
                ],
                "chunk_size": 50,
                "data_unit": "v",
                "time_unit": "s",
                "cue_time_unit": "ISO8601"
            }
        }
    ]
}

In this example, "topic", "module" and possibly "data_name" are extracted and used by the main kafka-to-nexus code while the rest of the keys are passed on directly to the file writing module when allocating it.

SkyToGround commented 7 years ago

One more thing: My opinion is that the shape of the arrays as sent to the file writer and not the shape of the arrays in the HDF5 file should be the one defined under "shape".

matthew-d-jones commented 7 years ago

@SkyToGround If I understand correctly, your main point here is that it should be clear in the schema what is part of the static structure which is known at the start of a run and what plugins are responsible for writing from streamed data.

I do not think that individual data sets should be specified in the JSON structure as the file writing module/plugin will some times need to create additional "helper" data sets

Could you elaborate on what you mean by helper datasets? Also, do you mean that you think we shouldn't specify datasets in groups which a plugin is responsible for, or in general anywhere in the schema?

My opinion is that the shape of the arrays as sent to the file writer and not the shape of the arrays in the HDF5 file should be the one defined under "shape".

I thought that this definition of shape is a requirement of the file writing process, given that we are writing across multiple file systems. Maybe Dominik can comment on that.

dominikwerder commented 7 years ago

@matthew-d-jones https://github.com/ess-dmsc/kafka-to-nexus/issues/17#issuecomment-317753754

I think my expectation would be for the shape to match the provided values by default.

I like that as well.

I'm not sure what the role of space is. Is there a difference between "size": [1] and "space": "scalar"?

Yes, HDF distinguishes them. There is even a null dataspace which is always empty, but we can add that in the future if the need arises.

@SkyToGround https://github.com/ess-dmsc/kafka-to-nexus/issues/17#issuecomment-317786957

File writing meta data can be logically split into three groups: Group attributes, e.g. "NX_class". These should probably not be handled by the individual writing modules/plugins directly.

This is how it works today.

Data set attributes, e.g. "units". These will need to be handled by the file writing module/plugin.

This is also how it is today. But if we adapt the new schema, the json command would be free to also set those. The writer module is still free of course to overwrite them. Do you see the need for a blacklist of attribute names that the json command can not set?

I think that the json command should not define explicitly the entire HDF structure in terms of groups, datasets and attributes. But it must contain enough information such that the file writer modules can create their datasets and attributes without waiting for the first data to arrive.

So, merging with the above and with the discussion at https://jira.esss.lu.se/browse/DM-416, I arrive at something like this:

{
  "nexus_structure": {
    "children": [

      {
        "name": "some_group",
        "type": "group",
        "attributes": [{"NX_class": "NXinstrument"}],
        "children": [

          {
            "name": "A simple static dataset, not written by a module",

            // This dataspace is created by the file writer and filled with the
            // given values.
            "type": {
              "space": "simple",
              "type": "uint64",
              "size": ["unlimited", 4, 2]
            },
            "values": [
              [ [1, 2], [2, 3], [3, 4], [4, 5] ],
              [ [1, 2], [2, 3], [3, 4], [4, 5] ]
            ],

            // These attributes are created by the file writer.
            "attributes": [
              {"NX_class": "NXdetector"},
              {"this_will_be_a_double": 0.123},
              {"this_will_be_a_int64": 123}
            ]
          },

          {
            "type": "stream",

            // NOTE:
            // Because this is type "stream", the filewriter itself will not
            // create any of these `attributes` or a dataset. The file writer
            // core will expose helper functions such that the file writer
            // module can create those attributes easily as if done by the file
            // writer core itself.
            // The file writer module gets, like it is today, the current HDF
            // group in this tree.
            // The module is free, like it is today, to create one or more
            // datasets, attributes, other groups, ...
            // The module gets also, like today, this json object in which this
            // comment stands and can do whatever it wants with it, e.g. create
            // the attributes on its 'main' dataset (or not).

            "attributes": [
              {"this_will_be_a_double": 0.123},
              {"this_will_be_a_int64": 123}
            ],
            "type": {
              "space": "simple",
              "type": "uint64",
              "size": ["unlimited", 4, 2]
            },
            "stream": {
              "topic": "some.topic.with.multiple.sources",
              "source": "for_example_motor01",
              "arbitrary_other_options_here": "can be used by writer module"
            }
          }

        ]
      }

    ]
  }
}

One more thing: My opinion is that the shape of the arrays as sent to the file writer and not the shape of the arrays in the HDF5 file should be the one defined under "shape".

There is no need to define the shape of the arrays which are sent to the file writer in the flat buffers.

On the other hand, we need to define the shape of the HDF structure up front. So that is what type and size are for.

dominikwerder commented 7 years ago

@matthew-d-jones

  • Would we send the geometry description as part of this schema? For some instruments it is likely to exceed what we can put in a single message.

I think it's worth to try to keep the command messages small. Would there be some possibility to send that data also via a stream?

matthew-d-jones commented 7 years ago

@dominikwerder You're example looks good. Is there an example of a plugin in the codebase yet?

Yes, sending the geometry information in a separate topic makes the most sense. Your comments have clarified that a plugin can write multiple groups, so it should work fine.

dominikwerder commented 7 years ago

@matthew-d-jones Yes, the code which is responsible for writing some flatbuffer schema id is what I call plugin (or module). An example where the writer for ev42 uses some options from the main configuration file is shown here: https://github.com/ess-dmsc/kafka-to-nexus/blob/master/src/schemas/ev42/ev42_rw.cxx#L86

In that same spot, I could also access FBSchemaWriter::config_stream for the stream-specific configuration options that came via the json command.

SkyToGround commented 7 years ago

@SkyToGround: I do not think that individual data sets should be specified in the JSON structure as the file writing module/plugin will some times need to create additional "helper" data sets

@matthew-d-jones: Could you elaborate on what you mean by helper datasets? Also, do you mean that you think we shouldn't specify datasets in groups which a plugin is responsible for, or in general anywhere in the schema?

As an example, NXlog and NXevent_data both define a bunch of different data sets. All the data sets are optional however. Thus some file writing module/plugin might implement the use of the raw_value whereas others might implement value in an NXlog.

@SkyToGround: My opinion is that the shape of the arrays as sent to the file writer and not the shape of the arrays in the HDF5 file should be the one defined under "shape".

@matthew-d-jones: I thought that this definition of shape is a requirement of the file writing process, given that we are writing across multiple file systems. Maybe Dominik can comment on that.

@dominikwerder: There is no need to define the shape of the arrays which are sent to the file writer in the flat buffers. On the other hand, we need to define the shape of the HDF structure up front. So that is what type and size are for.

Yes, the shape of data sets for non scalar data has to known when creating the HDF5 file in order for SWMR functionality to work. However, the final shape can (and should) be determined by the file writing module/plugin. Forcing the one who creates the JSON structure to actually define which dimension should be the one which is "unlimited" is more error prone, user unfriendly and requires the coder of the file writing module/plugin to implement more code in order to account for different positions of the "unlimited" dimension.

@dominikwerder The example you gave have a few ambiguities. Modifying it slightly yields:

{
    "nexus_structure": {
        "children": [
            {
                "name": "some_group",
                "type": "group",
                "attributes": {
                    "NX_class": "NXinstrument"
                },
                "children": [
                    {
                        "type": "dataset",
                        "name": "A simple static dataset, not written by a module",
                        "dataset": {
                            "space": "simple",
                            "type": "uint64",
                            "size": ["unlimited", 4]
                        },
                        "values": [
                            [1, 2],
                            [2, 3],
                            [3, 4],
                            [4, 5]
                        ],
                        "attributes": {
                            "NX_class": "NXdetector",
                            "this_will_be_a_double": 0.123,
                            "this_will_be_a_int64": 123
                        }
                    },
                    {
                        "type": "stream",
                        "attributes": {
                            "this_will_be_a_double": 0.123,
                            "this_will_be_a_int64": 123
                        },
                        "dataset": {
                            "space": "simple",
                            "type": "uint64",
                            "size": [4, 2]
                        },
                        "stream": {
                            "topic": "some.topic.with.multiple.sources",
                            "source": "for_example_motor01",
                            "module": "ev42_v1",
                            "arbitrary_other_options_here": "can be used by writer module"
                        }
                    }
                ]
            }
        ]
    }
}

Some notes on this:

dominikwerder commented 7 years ago

@dominikwerder: There is no need to define the shape of the arrays which are sent to the file writer in the flat buffers. On the other hand, we need to define the shape of the HDF structure up front. So that is what type and size are for.

@SkyToGround: Yes, the shape of data sets for non scalar data has to known when creating the HDF5 file in order for SWMR functionality to work. However, the final shape can (and should) be determined by the file writing module/plugin. Forcing the one who creates the JSON structure to actually define which dimension should be the one which is "unlimited" is more error prone, user unfriendly and requires the coder of the file writing module/plugin to implement more code in order to account for different positions of the "unlimited" dimension.

Well, I did say in the // NOTE: comment in the proposal, that the file writer will not create anything for the "type": "stream" children, so yes, the file writer module does have full control and responsibility. The shape given in json can be ignored by the author of the file writer plugin, or it can be given in a different way in the json.

Do we really need to use a list of dictionaries when defining attributes? HDF5 requires that attributes have unique keys (names) anyways so why not simply use a dictionary of attributes?

Currently, we do use a dictionary of attributes, but I think @matthew-d-jones liked the list form better.

A static data set (one which is not modified after its initial creation) should probably have its own "type" even though it probably could be inferred automatically. I have used the type "dataset" in this example.

It did already have its own type: It was a json-object with a member space = simple/scalar.

As we are NOT actually having the "core kafka-to-nexus" code create any data sets in the "stream" type, I think that it is a bad interface to locate the "dataset" dictionary where it is. I think it should be moved to the "stream" dictionary though I have not yet done so here. See my previous comment.

I thought about that first, but then I liked the other way better for the similarity with the non-module created datasets, but I have no strong opinion here.

matthew-d-jones commented 7 years ago

I think I went for a list for attributes because I was unsure if we need multiple fields in each attribute. If we don't, the dictionary is fine.

SkyToGround commented 7 years ago

A static data set (one which is not modified after its initial creation) should probably have its own "type" even though it probably could be inferred automatically. I have used the type "dataset" in this example.

It did already have its own type: It was a json-object with a member space = simple/scalar.

This means that for some types the value of "type" is a string whereas for some other types the value is a dictionary which in my opinion is an inconsistent design. Better to explicitly mention what type a type is. It also ties into the problem of having multiple identical keys in a dictionary.

I think I went for a list for attributes because I was unsure if we need multiple fields in each attribute. If we don't, the dictionary is fine.

I am not convinced that we need to implement the ability to use complex attributes but if we wanted to, this would still be possible. See the following simple example:

"attributes": {
    "this_will_be_a_double": {"value":3.15, "type":"float64"},
    "this_will_be_an_uint8": {"value":42, "type":"uint8"}
}