crdoconnor / strictyaml

Type-safe YAML parser and validator.
https://hitchdev.com/strictyaml/
MIT License
1.47k stars 60 forks source link

Clarification on tags justification, e.g. AWS cloudformation's shorthand private tags #37

Open simonbuchan opened 6 years ago

simonbuchan commented 6 years ago

First off, I really like this library, and the design choices you've made, so thanks!

I was looking at the removed features, and it lists explicit tags as being a form of syntax typing, which is absolutely bad when it's defined by the schema, yes! But tags don't have to be used that way, they can be used as a reserved syntax for alternate ways to provide a value, in particular AWS Cloudformation uses them as a short-hand for their "function" syntax:

Without tag shorthand (or flow):

Parameters:
  HostedZoneName: ...
  RecordName: ...
  RecordComment: ...
  ...

Resources:
  LoadBalancer: ...

  RecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneName:
        Fn::Sub: '${HostedZoneName}.'
      Comment:
        Ref: RecordComment
      Name:
        Fn::Sub: '${RecordName}.${HostedZoneName}.'
      Type: A
      AliasTarget:
        DNSName:
          Fn::GetAtt:
          - LoadBalancer
          - DNSName
        HostedZoneId:
          Fn::GetAtt:
          - LoadBalancer
          - CanonicalHostedZoneNameID

With tag shorthands:

  RecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneName: !Sub '${HostedZoneName}.'
      Comment: !Ref RecordComment
      Name: !Sub '${RecordName}.${HostedZoneName}.'
      Type: A
      AliasTarget:
        DNSName: !GetAtt LoadBalancer.DNSName
        HostedZoneId: !GetAtt LoadBalancer.CanonicalHostedZoneNameID

Embedding a sub-syntax in strings like suggested in #20 here is a bad idea, as the transformation is generic across the whole document (even if there are places where it's not valid), and there are plenty of values (like Comment) that permit arbitrary values; so I think AWS has the right long-hand syntax, but as you can see it quickly gets unwieldy, so the private tags are very heavily used. In this sense, tags are used as an already reserved syntax that can safely escape an embedded string syntax (rather than adding another level of escaping).

As another example that is closer to the original justification for removal, you can also (and it is probably the original intention of tags) use tags to provide types better syntax and lower (user) implementation cost where it's not directly providable by the schema, for example:

shape:
  fill: !LinearGradient
    from: 0 10
    to: 30 10
    stops: red 0.0, blue 0.2, green 1.0
  path:
  - !Move 0 0
  - !Arc 20 0 20 20
  - !Line 20 0

That said, I'm perfectly OK with strictyaml not supporting tags for implementation or compatibility complexity, or other such reasons, but the justifications only talk about using them for syntax typing, and then only for yaml built-in types, which is insufficient for me to remove this feature on its own.

To be clear, just updating the docs would be fine, though I won't refuse adding tags support 😇

crdoconnor commented 6 years ago

Hi Simon,

Thanks for the comments, kind words, etc. Sorry I haven't answered earlier. This is a good comment and I'll link to it in the docs.

I was looking at the removed features, and it lists explicit tags as being a form of syntax typing, which is absolutely bad when it's defined by the schema, yes! But tags don't have to be used that way, they can be used as a reserved syntax for alternate ways to provide a value

I think this is a valid way of using them, but equally I think that they are not a necessary feature in order to implement that. I follow the rule of least power pretty assiduously when I define DSLs. The corollary of that principle being that unless I consider a powerful new feature necessary I leave it out - "usefulness" is not enough of a prerequisite on its own.

In your examples, I think that the benefits of using tags could still be achieved fairly easily without using them, so it fails the necessity test. The schema language I have defined with strictyaml could possibly make parsing your example above a bit easier (and I'd be happy to make improvements of that kind), but there's nothing intrinsically stopping it from being done even right now.**

As another example that is closer to the original justification for removal, you can also (and it is probably the original intention of tags) use tags to provide types better syntax and lower (user) implementation cost where it's not directly providable by the schema

This absolutely happens. However, where this is being done I'd consider it a bug in the schema that needs to be fixed - and not a bug that the schema language should attempt to work around.

I feel like your second example is actually slightly confusing to a non-programmer - the notion that exclamation points should be used in one place but not the other, for instance.

That said, I'm perfectly OK with strictyaml not supporting tags for implementation or compatibility complexity

It's mainly because I don't feel like the schema language should have knowledge or opinion of types beyond string, mapping and list because doing so opens up such an incredible can of worms. The problem I had that actually kicked this project off was largely due to some of those worms.

Where there is a need for users to have options on how they supply data that have type implications I feel like moving the problem of handling types to the programmer writing the parser is a better solution - it keeps the schema language from being overloaded with cruft that will confuse things.

** There is a minor exception in that it will forcefully reject the use of unquoted ! because it intentionally disallows this feature. If you truly wanted a strictyaml schema that had a smart interpretation of strings that start with ! it would have to start with a quote (') or be done using a multiline string (|).

simonbuchan commented 6 years ago

Yeah, there is definitely a syntax boost with correctly used tags, but they definitely aren't required. As a personal preference, I would probably keep them, since the syntax guides the semantics, a property I like to preserve, but I'm not the one that did the work of creating a language, so I don't get to complain too much!

Your footnote implied you were thinking "!some-type some-value"? I would definitely avoid being that close to a yaml feature in a yaml subset. Further, in the first example, it's a bit sucky to shadow valid values and require another level of escaping 🤷‍♂️. I'm happiest with the existing non-tag syntax AWS has, an object that has one property, e.g. !foo bar -> foo: bar. The biggest problem with that is YAML (and thus strictyaml) doesn't support nesting property names on one line, which makes cases where every property has a "typed" value much noisier and harder to read, as seen in the examples.

I've also seen (including in the same AWS example!) the object with a type property, which can also work out well if you're going to have an object anyway. Supporting this generically is possible with type-unions (and in typed implementation languages, discriminators support), but gets tricky to give good error messaging (e.g. in your own example for unions it confusingly reports "expected an integer" for Bool() | Int()), so it probably makes more sense to support this at the schema level directly if this is recommended usage.

If you were thinking of additions to cover this kind of usage, the schema supporting (some generalization of?) "exactly one of these properties" would be the thing I would suggest - perhaps the key validator can do this?

I feel like your second example is actually slightly confusing to a non-programmer - the notion that exclamation points should be used in one place but not the other, for instance.

I am always wrong about what confuses and doesn't confuse non-programmers, so it's quite possible 😅. That said, I think it's handy that the exclamation point is "loud" here, saying how it's different. e.g., "you can only have one name here, but it can be one of a set of valid names, each of which have different contents". Only usability testing would say for sure, but I doubt the difference would be so significant that it would outweigh other concerns, either design principles or the library vs (code) usage complexity concerns on the other side.

witten commented 5 years ago

Piggybacking on this issue, although perhaps I should open a new one.. Here's another potentially valid use of tags: YAML file includes. Here's an actual example from the wild:

retention:
    !include /etc/borgmatic/common_retention.yaml

The idea is that a common YAML fragment gets dynamically included into the YAML document in question at runtime. The main rationale is reuse, so as to avoid having to repeat common configuration in multiple documents. (Usable by non-programmers? Perhaps, although it is admittedly a little advanced.)

But wait, I hear you say, can't you do that outside of strictyaml, and then still use strictyaml for the other aspects of YAML parsing and validation? You can, but not without problems. Two alternative approaches I can think of:

  1. Prior to feeding the YAML document to strictyaml, pre-process it (e.g. with raw ruamel.yaml) to inline all includes and produce a single document. This works, but then any line numbers in strictyaml error messages are completely bogus in relation to the source YAML files.
  2. Or, escape the include tags (as suggested by a comment above), and then after strictyaml parses and validates the YAML document, post-process the include directives at the application level. This doesn't really work well though, because then you give up strictyaml schema validation on any part of the YAML document that's pulled in by an include. And in fact, schema validation may simply not work if a required part of the document is hidden behind an include tag that strictyaml doesn't understand.

To be clear, I'm not making a feature request here for file include functionality in strictyaml (although that'd be pretty great). Rather, I'm making the case that support for custom tags in strictyaml would be pretty darn useful — even necessary for some use cases.

crdoconnor commented 5 years ago

Hi @witten thanks for your comment

Prior to feeding the YAML document to strictyaml, pre-process it (e.g. with raw ruamel.yaml) to inline all includes and produce a single document. This works, but then any line numbers in strictyaml error messages are completely bogus in relation to the source YAML files.

If you write your own processing step which picks up a filename from the 'master' document and then tries to read it with another schema from the 'child' included document then the line number of any schema violation for the child document would be correct, would it not?

Or, escape the include tags (as suggested by a comment above), and then after strictyaml parses and validates the YAML document, post-process the include directives at the application level. This doesn't really work well though, because then you give up strictyaml schema validation on any part of the YAML document that's pulled in by an include.

Well, you could validate them separately, could you not?

witten commented 5 years ago

Thanks for the quick response.

If you write your own processing step which picks up a filename from the 'master' document and then tries to read it with another schema from the 'child' included document then the line number of any schema violation for the child document would be correct, would it not?

Yes, but not easily! With the particular include approach I happen to be using: A user can decide to factor out and include any arbitrary portion of the main YAML document. So I don't necessarily have a separable schema for just the fragment that they've put in a separate file. I suppose, before feeding the YAML with escaped includes to strictyaml, I could try to dynamically split the main schema into separate portions based on where the includes are. But then that'd require both a pre-processing step (to locate the includes and split up the schema) and a post-processing step (to interpret the includes and apply the sub-schemas at the application-level).

crdoconnor commented 5 years ago

Is there a particular issue with that approach? I have quite a few systems using strictyaml which do multipass validation.

witten commented 5 years ago

It just seems like a lot of work — three passes, and fair amount of complexity to make the schemas separable at runtime — to do something that could in theory be done in one pass. But I'll give it a shot anyway. :smiley:

For comparison, my current non-strictyaml code does one pass to load, and then does validation on the resulting data structure in memory. (I do realize that performance is not one of strictyaml's main goals.)