Open yousefmoazzam opened 3 weeks ago
Thanks @yousefmoazzam for looking into this. It looks like we're too vulnerable by exposing just YAML to the users. The major issue is that we cannot pick up most of the problems before the run. To me it feels like we need to start our developing on JSON schema as soon as possible and it should be in place before HTTomo in production. Without proper parameters validation, we should expect lots of problems related to YAML errors from inexperienced users.
The problem
In httomo pipeline files mappings are used fairly frequently, such as when defining the parameters and parameter values for a method:
From some unexpected behaviour when running httomo at a beamline, investigations have led to the discovery that a mapping in YAML must have either:
For the first case, if there is no space after the colon character, then the text following is parsed as a python
str
, rather than as a pythondict
.This means that if a user were to miss a space after a colon character, then this can cause the YAML to be parsed differently to how it was intended, and thus cause runtime errors.
Explanation
For example, for the following YAML (note the spaces after the colon characters after
start
andstop
):the
detector_y
field and its value parses to the following python data structure (note that the value of thedetector_y
key in the dict is adict
):However, if the spaces after the two colon characters are omitted (the change in syntax highlighting compared to the previous example is suggestive of there being a change in meaning of the value):
the
detector_y
field and its value parses to the following python data structure (notice that the value of thedetector_y
key in the dict is a string):I haven't found anything that explicitly states that without a space after a colon, the value is interpreted as a string. The closest I can find is in the YAML spec here where it states that a whitespace character needs to follow a colon in order to define a mapping (but it doesn't say what happens if you omit the colon, it may be somewhere else in the YAML spec, but it's a long document to sift through...).
What can be done?
I'm not sure yet. I took at a look to see if any YAML linters would allow the catching of if a colon was missing a space after it, such as
yamllint
, which has some options for configuring the rules when encountering a colon character.However, it seems like it's not possible to catch this (partly due to it being valid YAML to have a missing space after a colon), see https://github.com/adrienverge/yamllint/issues/563#issuecomment-1508763231 and the rest of that issue.
In particular, in that comment, it's suggested that conversion to JSON, followed by defining a JSON schema, could solve the problem.
With the development of the web GUI moving forward and it involving this very idea of using a JSON schema to perform validation of parameters in pipeline files, this may well be an issue that could be resolved in that larger discussion.