DiamondLightSource / httomo

High-throughput tomography pipeline
https://diamondlightsource.github.io/httomo/
Other
7 stars 4 forks source link

YAML with a colon but no space is valid YAML, but produces a string instead of a map #524

Open yousefmoazzam opened 3 weeks ago

yousefmoazzam commented 3 weeks ago

The problem

In httomo pipeline files mappings are used fairly frequently, such as when defining the parameters and parameter values for a method:

- method: remove_stripe_based_sorting
  module_path: httomolibgpu.prep.stripe
  parameters:
    size: 11
    dim: 1

From some unexpected behaviour when running httomo at a beamline, investigations have led to the discovery that a mapping in YAML must have either:

For the first case, if there is no space after the colon character, then the text following is parsed as a python str, rather than as a python dict.

This means that if a user were to miss a space after a colon character, then this can cause the YAML to be parsed differently to how it was intended, and thus cause runtime errors.

Explanation

For example, for the following YAML (note the spaces after the colon characters after start and stop):

- method: standard_tomo
  module_path: httomo.data.hdf.loaders
  parameters:
    data_path: entry1/tomo_entry/data/data
    image_key_path: entry1/tomo_entry/instrument/detector/image_key
    rotation_angles:
      data_path: /entry1/tomo_entry/data/rotation_angle
    preview:
      detector_y:
        start: 100
        stop: 120

the detector_y field and its value parses to the following python data structure (note that the value of the detector_y key in the dict is a dict):

{
  'detector_y': {'start': 100, 'stop': 120}
}

However, if the spaces after the two colon characters are omitted (the change in syntax highlighting compared to the previous example is suggestive of there being a change in meaning of the value):

- method: standard_tomo
  module_path: httomo.data.hdf.loaders
  parameters:
    data_path: entry1/tomo_entry/data/data
    image_key_path: entry1/tomo_entry/instrument/detector/image_key
    rotation_angles:
      data_path: /entry1/tomo_entry/data/rotation_angle
    preview:
      detector_y:
        start:100
        stop:120

the detector_y field and its value parses to the following python data structure (notice that the value of the detector_y key in the dict is a string):

{
  'detector_y': 'start:100 stop:120'
}

I haven't found anything that explicitly states that without a space after a colon, the value is interpreted as a string. The closest I can find is in the YAML spec here where it states that a whitespace character needs to follow a colon in order to define a mapping (but it doesn't say what happens if you omit the colon, it may be somewhere else in the YAML spec, but it's a long document to sift through...).

What can be done?

I'm not sure yet. I took at a look to see if any YAML linters would allow the catching of if a colon was missing a space after it, such as yamllint, which has some options for configuring the rules when encountering a colon character.

However, it seems like it's not possible to catch this (partly due to it being valid YAML to have a missing space after a colon), see https://github.com/adrienverge/yamllint/issues/563#issuecomment-1508763231 and the rest of that issue.

In particular, in that comment, it's suggested that conversion to JSON, followed by defining a JSON schema, could solve the problem.

With the development of the web GUI moving forward and it involving this very idea of using a JSON schema to perform validation of parameters in pipeline files, this may well be an issue that could be resolved in that larger discussion.

dkazanc commented 3 weeks ago

Thanks @yousefmoazzam for looking into this. It looks like we're too vulnerable by exposing just YAML to the users. The major issue is that we cannot pick up most of the problems before the run. To me it feels like we need to start our developing on JSON schema as soon as possible and it should be in place before HTTomo in production. Without proper parameters validation, we should expect lots of problems related to YAML errors from inexperienced users.