crdoconnor / strictyaml

Type-safe YAML parser and validator.
https://hitchdev.com/strictyaml/
MIT License
1.44k stars 60 forks source link

Candidate for the 'why not X?' list #136

Open wolandscat opened 3 years ago

wolandscat commented 3 years ago

ODIN - Object Data Instance Notation. See here: https://github.com/openEHR/odin/blob/master/README.md The main spec is linked near the top of that page. Might give you some ideas on one of the reasons we don't use any of the existing languages: the need for a more comprehensive set of leaf types, including Intervals of ordered types (i.e. Interval<Integer>, Interval<Date> etc), and distinct Date, Time, DateTime and Duration types (with ISO8601 serialisations); also a Uri type. These types are ubiquitous in healthcare and other domains, and are horribly verbose in JSON and fairly bad in YAML. I do like the YAML approach and we use it a bit. I think however you are wrong on anchors and refs! Hope this is of interest.

crdoconnor commented 3 years ago

StrictYAML (and JSON, with the right code) would be compatible with your set of leaf types by embedding custom validation for strings. It was always my intention to make this type of thing (e.g. intervals) easy to work with by writing a small amount of code.

I actually deliberately avoided going into too much detail on scalar types simply because there's such an enormous variety out there, so many thousands of different edge cases, everybody's needs are different and parsers of these strings can be wildly complicated in and of themselves.

I could have included a country code type, for instance (GB, FR, NO) - enough people certainly want it - but even that embeds a whole set of issues like which set of country codes you use and what happens when south sudan breaks away from sudan or whether the USSR should be included. This type is also pretty ubiquitous but I still wanted to decouple the complication of handling things like this from the complication of just parsing markup with lists and mappings and strings. I don't know a lot about medical coded terms but I would imagine there are similar issues for those.

Instead my aim was to easily allow the creation of custom validators for all the weird and wonderful things you could parse from a string (your intervals or coded terms being a great example of that) and try to decouple that what I was trying to do - maintain a clear, easy to read and parse syntax around them. StrictYAML's core specification (not ready yet, but is on its way) will not try to encompass ISO8601 but will instead define how parsing can interoperate with it.

I think trying to go into too much detail on these things in the spec and allowing just string, int and boolean was a clever move by JSON that I wanted to replicate. YAML, TOML and, unfortunately, ODIN, fell into the same trap, I think, but creating a set of "all things for all people" types.

Your post has generated a few ideas (I've started thinking about how to manage standardized libraries of scalar types), so thanks for making it and thanks for introducing me to ODIN.

crdoconnor commented 3 years ago

I think however you are wrong on anchors and refs!

How so?

wolandscat commented 3 years ago

On the main question of representing more complex leaf types, I would be interested to see a form of YAML that can represent things like intervals of date, indeed even interval of integer in a single line would be very helpful. For us an interval is |x..y|. We can do List<Interval<Integer>> like this: attr1: <|0..60|, |60..90|, |90..110|, |110..130|, |>130|>. That's about 40 lines to do in JSON. A lot more examples here (just look for the || parts between the {} - it's within another syntax.

On the question of anchors and refs - you need them. Just replicating copies of content is not an option for a serious serialisation format - it's a maintenance issue: if you want to make a change to re-used content, if it's only there once, that's where you make the change. If it's copied, how do you tell that the intention is that they are all the same, or perhaps some copies are meant to be distinct instances? There's no way tell. Quite apart from the problem of actually having to apply a change N times.

Don't get me wrong - you have to follow your own design intentions with your format; I'm just providing a few extra things to think about - even if it ends up being something you explain 'why we didn't do X'.

wolandscat commented 3 years ago

BTW I meant to say that one of our design goals was to be able to have a parser recognise a lot of leaf types without any type information, just by the syntax. In ODIN, the list of leaf types recognised as atoms (not requiring multiple lines) is:

That is very useful - once you get coverage of all those types, file readability greatly improves and text size vastly reduces.

crdoconnor commented 3 years ago

BTW I meant to say that one of our design goals was to be able to have a parser recognise a lot of leaf types without any type information, just by the syntax.

Ah ok. That's exactly the opposite of my goal - to extract type information out of markup and leave that up to the schema. This was for three reasons:

On the main question of representing more complex leaf types, I would be interested to see a form of YAML that can represent things like intervals of date, indeed even interval of integer in a single line would be very helpful. For us an interval is |x..y|. We can do List<Interval> like this: attr1: <|0..60|, |60..90|, |90..110|, |110..130|, |>130|>. That's about 40 lines to do in JSON.

This can be done in StrictYAML with a custom validator.

withintervals = """x: 0..60, 60..90, 90.110"""

sy = load(withintervals, CommaSeparated(Interval(Int()))

Interval would have to be coded up, of course, but that's pretty easy. Also, you'd need to make a decision about what this parses to (there's no native interval type in python) and write some code to create that object and do the reverse - take the object and turn it into text.

Similarly, you could build something atop JSON to parse this:

{"x": "0..60, 60..90, 90.110"}

In much the same way, just as you would have to do with an ISO8601 date in JSON.

On the question of anchors and refs - you need them. Just replicating copies of content is not an option for a serious serialisation format - it's a maintenance issue: if you want to make a change to re-used content, if it's only there once, that's where you make the change.

That kind of bypasses my argument which was that very repetitive content was indicative of a need for a schema redesign. This is explained in my "why not nodes and refs" page where I used the example of how nodes and refs were used from wikipedia to explain how it could be refactored to a more readable, more typesafe schema design that didn't require them.

I've been through this iterative process of schema redesign on several StrictYAML markup languages and the net result is always better in the end.

The downside of this approach is that you need a feedback loop between your data and your schema for it to really evolve properly since it's rarely clear in advance what is going to become repetitive and why.

Sometimes use of nodes and refs is indicative of... well, somebody tried to make a full blown programming language in YAML (e.g. most build pipelines). That's also disgusting thing, in my mind. I wish they'd just used python :/

If it's copied, how do you tell that the intention is that they are all the same, or perhaps some copies are meant to be distinct instances? There's no way tell.

That's actually an even bigger red flag that your schema is in need of a redesign. If there are actually implicit semantics in your nodes and refs, they would be so much better represented explicitly.

Don't get me wrong - you have to follow your own design intentions with your format; I'm just providing a few extra things to think about - even if it ends up being something you explain 'why we didn't do X'.

Of course. I'm actually pretty happy we've had this discussion, it's given me a lot to think about. While I'm not about to try and do a kitchen sink approach to custom validators and start including coded terms, I might publish some code snippets to show how things like intervals and coded terms and country codes could be represented cleanly and tightly.