dtao / safe_yaml

Parse YAML safely
MIT License
217 stars 63 forks source link

Support for custom deserializer? #21

Open zwily opened 11 years ago

zwily commented 11 years ago

Here's the situation - for our app, safe_yaml works for everything except for a massive number of serialized HashWithIndifferentAccess. With safe_yaml, we have to choose between:

https://github.com/zwily/safe_yaml/commit/036c2cf206cc0da75ff8fef4b603e72fa8f55a30

Lots of people have been talking about doing whitelists/schemas/whatever to handle situations like this in Psych, but that support still seems like a ways off. It seems like to me a lightweight way to handle a simple use case like this would be to be able to specify in the SafeYAML configuration a mapping of tags -> blocks. So I could do:

SafeYAML::CONFIG[:custom_initializers] = {
    "tag:yaml.org,2002:map:HashWithIndifferentAccess" => Proc.new {
        HashWithIndifferentAccess.new
    }
}

Then when instantiating objects, if the node has one of the supplied tags, it would call the supplied Proc instead of using [] or {} (for example).

I can code up a pull request but I wanted to see if this seemed like a good approach first. If it's not appropriate for the library, then we'll probably continue using my forked version.

dtao commented 11 years ago

I do think something like this makes sense, and I don't have any particular objection to your suggested approach. I guess part of me doesn't want to get too crazy as I do think this sort of functionality will eventually be added to Psych itself; but in the meantime, I'm definitely open to adding this. If you want to submit a pull request, I'll merge it; otherwise, I can take a stab at implementing it myself later on—possibly tonight or tomorrow.

dtao commented 11 years ago

Take a look and let me know how this looks to you. It feels a little iffy to me—particularly how tags are defined differently in Syck and Psych—but then, since an application's likely to only use one or the other, maybe that isn't a big deal. Still, part of me wishes that in your example the key could be "!map:HashWithIndifferentAccess" since that's how it will be declared in the YAML. (I thought about coding SyckResolver and PsychHandler in such a way that they would work consistently, but that brought me back to the part about an application only using one.)

zwily commented 11 years ago

Yeah, I was experimenting with a similar approach last night, and the whole tag issue in Syck vs Psych was bumming me out too. I don't see a clean way around it though - we'll probably just define the initializer for both tags (to be ready for when we switch to Psych).

At some point in the Syck and Psych code, they must transform from tag -> class name, but I suspect that's a complicated process and not worth duplicating here.

codekitchen commented 11 years ago

I'm a bit hesitant to suggest extending this hack even further, but what about inverting responsibilities here and passing the built hash/array/scalar into the block. That'll allow classes that don't follow hash or array semantics to be whitelisted, though it feels like a lot of duplication of yaml internals... For example:

SafeYAML::CONFIG[:custom_initializers] = {
    "tag:ruby.yaml.org,2002:object:MyClass" => Proc.new { |hash|
        MyClass.initialize_from_hash(hash)
    }
}

My goal here is to be able to use safe_yaml in the short term even though we have a handful of classes that we need to be able to still deserialize.

codekitchen commented 11 years ago

Hmm I took a stab at implementing my suggestion, and while it was quite easy in SafeYAML::Syck, the nature of Psych makes it very difficult to handle things like recursive usage of anchors correctly with my suggested callback scheme. So it may be a non-starter.

Another alternative would be to do the same type of blank object allocation and reaching into objects that YAML normally does, but I feel like at that point we'd be re-implementing significant portions of YAML.

codekitchen commented 11 years ago

OK, we've decided to switch gears completely (zwily and I work together btw). This is a big departure from how safe_yaml works right now, so I'll run it by you and let you decide whether it's something you want as an option in safe_yaml, or if we should create a different project:

We're going to switch to a safe_load that just gathers a set of all tags used in the YAML document, it won't actually build up any deserialized objects at all. After parsing, if the set of tags used is included in a configurable set of whitelisted tags, we'll parse the YAML using unsafe_load and return it. If not, we'll raise an error.

dtao commented 11 years ago

@codekitchen: Thanks for giving it a shot. I was going to mention the challenge here, but it sounds like you encountered it on your own. I still think it would be possible, and I suspect that others would find something like this beneficial.

However my instinct is to go with something like a whitelist of tags, which would default to Syck or Psych's built-in functionality for deserializing trusted typed (as opposed to support custom user-defined deserialization methods). I actually think this would be the solution most consistent with my primary goal for SafeYAML: to make it easy to incorporate into existing apps.

dtao commented 11 years ago

@zwily and @codekitchen: Let me know how you feel about the state of SafeYAML now, with its whitelisted tag and custom initializer support.

I was tempted to pull out the custom initializer functionality completely after adding the tag whitelisting; but then I discovered that there are some tags that Psych actually doesn't deserialize in the way you might want. For example, at least on certain versions of Ruby, I couldn't seem to do a round-trip serialize/deserialize with a simple Hashie::Mash object using pure Psych (even with a seemingly correct tag, the doc would deserialize to a Hash). So I'm thinking custom initializers/deserializers might still be desirable for your own custom tags that you want tight control over.

It's also possible Syck and/or Psych already have this functionality built in, and I just don't know about it. (My knowledge of these libraries has been built basically through tinkering and reverse-engineering, as you can probably tell from the code.)

In any case, @codekitchen, do you think it still makes sense to implement your suggestion from a few days ago? It should be easier to do at this point, now that SafeYAML does the same thing with Syck and Psych (resolves a pre-parsed tree of nodes).

codekitchen commented 11 years ago

Yeah I ran into the same Hashie::Mash issue when updating the specs, it appears that Ruby 1.9.2's Psych simply has bugs, as far as I could tell. Probably why Syck was still the default until 1.9.3.

I think the changes look good. We ended up having to extend the whitelist even further in https://github.com/instructure/safe_yaml/commit/d0d2d00e476c7545b806f8f91341f7d587127b76 after discovering that we needed to allow deserialization of a couple Class objects (not instances, the Class itself), and all Class objects have the same YAML tag, so we added whitelist-with-a-block support so you could allow deserialization of some Classes and not all.

If we added my original suggestion for the custom initializer though, and passed the right information to the block, I think you could get the same functionality that way. It's a tradeoff, more flexibility and a little more complexity using the lib.

So as it is today, we still need to use our fork because of that last extension, but I don't think there's any reason we can't make it work.

jimmydjwu83 commented 9 years ago

Open