clj-commons / clj-yaml

YAML encoding and decoding for Clojure
Other
122 stars 26 forks source link

Handle unknown tags via PassthroughConstructor. #38

Closed grzm closed 2 years ago

grzm commented 2 years ago

Here's a port of the PassthroughConstructor from https://github.com/owainlewis/yaml

A couple of notes:

grzm commented 2 years ago

This probably addresses https://github.com/clj-commons/clj-yaml/issues/23

borkdude commented 2 years ago

Why is :constructor a no-arg function?

There's no examples of using the keyword args for parse-string in the README. I'll do that if it's thought to be useful as well.

Sure, maybe a different PR?

borkdude commented 2 years ago

Let's assume MIT (https://github.com/owainlewis/yaml/blob/master/LICENSE) until otherwise proven.

grzm commented 2 years ago

Why is :constructor a no-arg function?

I can think of three reasons:

I could very well be misunderstanding your question, though. Is there an alternative you'd prefer?

borkdude commented 2 years ago

@grzm I'm not deeply familiar with this library, but since you're porting from owainlewis's repo, I checked there and he seems to pass the constructor like this: :constructor yaml.reader/passthrough-constructor. This is why I wondered you chose for a thunk.

borkdude commented 2 years ago

Oh, now I see it: yaml.reader/passthrough-constructor is also a thunk 🤦

Well, this PR looks good to me then. What about you @lread ?

lread commented 2 years ago

Hiya! 👋 Ramping up on this one, so please bear with me!

Here's the original commit message from owainlewis/yaml:

In Clojure-land, we should allow unknown tags to be represented using the underlying data type. This is preferrable to failing when working with unknown YAML.

So:

I don't see docs around unsafe and mark options. They were implemented as booleans that create and then internally have snakeyaml use a snakeyaml constructor. I suppose they could now technically be replaced by the :constructor support.

The snakeyaml docs mention:

Note: if you want to limit objects to standard Java objects like List or Long you need to use SafeConstructor.

Ok. So we use the safe constructor by default.

And the mark constructor comes from clj-yaml and is explained by its tests.

As a general user, I'd probably just like to just choose features. Maybe I'd like position tracking (mark) and parsing to any old java object (unsafe) and to not barf on unknown tags. I don't know or care what a "constructor" is. (I realize these are all mutually exclusive features today).

As an advanced user, I might want to provide my own snakeyaml constructor. But then I might also want to provide my own snakeyaml other-thing?

These are not strong opinions, just my thoughts and notes while learning about all this.

Also: have we thunk (pun intended) through our naming? If we go this route, I wonder if :constructor-fn would clearer?

lread commented 2 years ago

Another thought: exposing a snakeyaml constructor would tie us to snakeyaml and make a change like moving to snakeyaml-engine a potentially breaking change for users for clj-yaml.

borkdude commented 2 years ago

@lread True, but at that point we could just release an all new clj-yaml 2.0 (with an effort to keep much of the API the same)

lread commented 2 years ago

I wonder, does clj-yaml currently expose snakeyaml implementation details anywhere else yet? If not, I'm not sure if the benefit of doing so outweighs the implementation detail leak.

More questions:

  1. is there a benefit to throwing on unknown tags? Should we just always read them? (It might be useful to know what the unknown tag was)
  2. I think that it is maybe incidental that unsafe and mark and this pass-through feature are mutually exclusive. If we were to remove this restriction, is there a logical reason these features could not be combined? Does unsafe and reading unknown tags make any sense? Probaly? If the unsafe object cannot be loaded, then it would be read as an unknown tag.

If nobody is in any great hurry, I am interested in two experiments (I'd volunteer to do the work).

  1. taking a feature-based approach and removing the mutal exclusivity on unsafe, mark and (maybe we'd call it) read-unknown-tags
  2. a proof-of-concept of moving to snakeyaml-engine, just to see how'd that would go and if my implementation detail leak concern is moot. (Not suggesting we make this move, just trying to gather info).

I don't mean to bog this PR down with so many questions. Hopefully, I am being helpful and not a pain in the butt.

borkdude commented 2 years ago

I wonder, does clj-yaml currently expose snakeyaml implementation details anywhere else yet?

One such thing is exceptions. I needed to expose this one in bb for people to properly handle YAML exception. I don't know if snake-yaml-engine has exactly the same exception type.

lread commented 2 years ago

Ah thanks, @borkdude, good to know, clj-yaml leaks snakeyaml exceptions.

No, it does not seem that snakeyaml-engine shares the same exception classes.

Snakeyaml:

image

Snakeyaml-engine:

image

So being new to this project, I should probably become aware of what clj-yaml wants to be:

  1. a YAML encoder/decoder that happens to be using snakeyaml
  2. a snakeyaml wrapper

Not suggesting one is better than the other but important to understand.

borkdude commented 2 years ago

Regardless of what it wants to be, I think it would almost certainly breaking if we silently moved to some other implementation: the risk is too high. We could at some point release a breaking change version 2.0.0 or so which does that and is almost a drop-in replacement so people could move over without much ado.

grzm commented 2 years ago

I'm starting to lean towards the idea that we should move forward with cli-commons/yayama (think Yo-Yo Ma + yayaml), but name TBD) on snakeyaml-engine. It looks like it has a smaller surface area (json types only). I think we could probably lean into the choice of underlying library: I think there's a lot of abstraction work that needs to be done to truly make the underlying implementation changeable.

Then again, what's the value if it's just a convenience wrapper? I've generally been displeased with wrappers, especially when the ergonomics of the underlying java lib aren't that bad: the layer of indirection gets in the way.

grzm commented 2 years ago

Because I can't leave well-enough alone. I hate the names and I'm not happy with the keyword args, and neither round trip, but the UnknownTagConstructor does this:

Parameters:
  paramEnvironmentType: # ask a user to define whether it is 'dev' or 'prod'
    Description: Environment type
    Default: dev # by default it is 'dev' environment
    Type: String
    AllowedValues: [dev, prod]
    ConstraintDescription: Must specify 'dev' or 'prod'

Conditions:
  isProd: !Equals [!Ref paramEnvironmentType, prod] # if 'prod' then TRUE, otherwise FALSE

Resources:
  myVolume: # create a new EBS volume only if environment is 'prod'
    Type: 'AWS::EC2::Volume'
    Condition: isProd # conditionally create EBS volume (only if environment is 'prod')
    Properties:
      Size: 100
      # etc.

with :unknown-tag (wraps them in maps with namespaced keys)

{:Parameters
  {:paramEnvironmentType
   {:Description "Environment type",
    :Default "dev",
    :Type "String",
    :AllowedValues ("dev" "prod"),
    :ConstraintDescription "Must specify 'dev' or 'prod'"}},
  :Conditions
  {:isProd
   #:clj-yaml.core{:tag "!Equals",
                   :value
                   (#:clj-yaml.core{:tag "!Ref",
                                    :value "paramEnvironmentType"}
                    "prod")}},
  :Resources
  {:myVolume
   {:Type "AWS::EC2::Volume",
    :Condition "isProd",
    :Properties {:Size 100}}}}

with :passthrough (drops/strips unknown tags)

{:Parameters
  {:paramEnvironmentType
   {:Description "Environment type",
    :Default "dev",
    :Type "String",
    :AllowedValues ("dev" "prod"),
    :ConstraintDescription "Must specify 'dev' or 'prod'"}},
  :Conditions {:isProd ("paramEnvironmentType" "prod")},
  :Resources
  {:myVolume
   {:Type "AWS::EC2::Volume",
    :Condition "isProd",
    :Properties {:Size 100}}}}

(That reminds me of another thing I'd like to have in version 2: YAML sequences as Clojure vectors, not sequences/lists)

grzm commented 2 years ago

(The signing GPG failure isn't me, right?)

borkdude commented 2 years ago

The GPG failure is @slipset's - perhaps he can come up with something that doesn't make PRs seem like they fail? ;)

borkdude commented 2 years ago

@grzm I like these solutions. Perhaps the name :unknown-tag could be improved. Is the argument to :unknown-tag a function which receives this node and then converts it and did you call it with identity here?

About sequences: these are sequences for a reason, because YAML allows recursive / infinite / circular references. There was an issue in the past to coerce to vector. I would want to have this as an option since the 99% case is that you won't have these circular references. Perhaps we could have :seq-fn or so which coerces sequences. See https://github.com/clj-commons/clj-yaml/issues/29 and https://github.com/clj-commons/clj-yaml/pull/18. Although the previous conclusion was to not add this option, I think we should consider it again and do it.

lread commented 2 years ago

@grzm I just merged a ci change to master that should fix these GPG deploy failures we are getting on PRs.

grzm commented 2 years ago

Okay, I'm adequately happy with the new names. However, the names have an interesting contrast with the passthrough-constructor from owainlewis/yaml:

The reason I've landed on these names is that these are for handing unknown tags, and I wanted names that reflect this. There may be a bit of confusion for those moving from owainlewis/yaml, but they're going to need to understand the differences between the two libraries anyway.

I've removed the generic constructor option to avoid exposing more snakeyaml implementation details.

I did have a thought of passing a function that handles a node with an unknown tag, but I think that ends up exposing some of the snakeyaml internals anyway.

This should handle my personal use case (I'll be using the strip-unknown-tags? option), and includes an option to know what unknown tag is and potentially handle them however you want.

One minor thing is whether the namespaced should just be :clj-yaml/tag and :clj-yaml/value instead of :clj-yaml.core/tag and :clj-yaml.core/value, but we're already requiring theclj-yaml.corenamespace, so in the common case one would likely use a::yamlalias forcli-yaml.core` anyway, which is even shorter.

If this looks acceptable, I'll add some notes to the README and submit a single-commit PR.

borkdude commented 2 years ago

@grzm Feedback:

Design

I think having just one option which is a function which gets a map with {:tag .. :value ..} is more flexible and we can add values to that over time. I don't really see the need for using fully qualified keywords there: having simple keywords would improve the UX by passing just a simple keyword for the "strip tag" behavior.

:default-tag-fn (fn [m] (:value m))

or simply:

:default-tag-fn :value

would be the same as :strip-unknown-tags true.

Style

I'd like to avoid using question marks for boolean keyword options. Question marks are usually used for predicate functions, not boolean values. E.g. Clojure metadata uses :macro true, not :macro? true. Also, the option might become more than a boolean in the future, e.g. a function or a set of possible keyword. So avoiding question marks in option names gives more flexibility.

grzm commented 2 years ago

One more try: minor difference from suggestion of passing a map with :tag and :value keys; Using positional arguments allows us to not have to specify these names and is suitably unsurprising.

One motivation for this is that I was hoping to pass identity as the function to allow the generic pass-through of the unknown tags: in that case, with keywords un-namespaced by cli-yaml or cli-yaml.core, the returned maps would be indistinguishable from data. Expecting the unknown-tag-fn handler to expected namespaced keywords seemed unnecessarily cumbersome.

borkdude commented 2 years ago

I don't really like the idea of using a positional function: this is really inflexible when it comes to future changes.

https://twitter.com/borkdude/status/1572203023367131138

borkdude commented 2 years ago

the returned maps would be indistinguishable from data

Yes, you would have to provide some function to a sensible transformation, but the 99% use case will probably be :default-tag-fn :value

grzm commented 2 years ago

I beginning to feel that the bike shed really should be yellow. ;)

This particular handler is specific to its use case. It's not a generic transform function that will be used in any other pipeline.

I did consider following the equivalent for functions that take MapEntries, and use a single argument consisting of a two-element tuple (defn unknown-tag-fn [[tag value]] ,,,)

compare:

(map (fn [[k v]] ,,,) {:foo :bar, :baz :bat})
grzm commented 2 years ago

(Also for cli-yaml v2: follow clojure.data.json and cheshire for handing keywords; default to not keywordizing, and take :keyword-fn instead of a boolean)

borkdude commented 2 years ago

@grzm The reason I'm also suggesting :default-tag-fn is that it is very similar to other clojure APIs:

user=> (edn/read-string {:default (fn [tag val] [tag val])} "#foo [1 2 3]")
[foo [1 2 3]]

but the usage of positional functions has a serious drawback of breaking future callers when you add things, that is why I suggested a map. It's pretty standard API design practice nowadays.

grzm commented 2 years ago

Yes. Maps everywhere is common pattern. Another common pattern is

(defn unknown-tag-fn 
  ([tag value] (unknown-tag-fn tag value {})
  ([tag value opts] ,,,))

This also allows extension, while providing ease for the base case.

I use both, which one depending on what is more ergonomic in a given context. To me, it's not one-size-fits-all.

Re: default-tag-fn: I didn't realized the parallels with edn/read-string. Thanks for the additional color. That said, the default function in edn/read-string does have exactly the function signature I'm proposing: two args, tag and value.

borkdude commented 2 years ago

That said, the default function in edn/read-string does have exactly the function signature I'm proposing: two args, tag and value.

This is why I added:

but the usage of positional functions has a serious drawback of breaking future callers when you add things, that is why I suggested a map. It's pretty standard API design practice nowadays.

borkdude commented 2 years ago

Having a multi-arg fn option isn't a good API imo, since you cannot detect (without hacks) if a user provided a 2-arg or 3-arg function or both. Yes, you could try calling the function in a try/catch with 2 or 3 arguments first, but why bother if you can implement it robustly from the start using a 1-arg map-receiving function.

borkdude commented 2 years ago

To summarize:

I think I agree 95% with your design, it's just that:

borkdude commented 2 years ago

Thanks for processing the feedback!

If this looks acceptable, I'll add some notes to the README and submit a single-commit PR.

I think we reached this point now. Single commit isn't necessary, we can squash-merge this, but if you prefer, single commit is ok.

grzm commented 2 years ago

I updated the README in the last commit. Feel free to squash merge. Thanks!

borkdude commented 2 years ago

@grzm Will do, but there are conflicts now. Is it possible that you fix these? Could also do this locally, if you want.

borkdude commented 2 years ago

@grzm Already on it. Merged version pushed here now: https://github.com/clj-commons/clj-yaml/tree/grzm-passthrough-constructor

Will merge from there.

grzm commented 2 years ago

Cheers! I was working on fixing the conflicts but haven't been able to steal enough time between sessions at the conference. Thanks everyone for their feedback!

borkdude commented 2 years ago

I figured as much. Enjoy the conf!