crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.46k stars 1.62k forks source link

[RFC] Out of the box JSON/YAML serialization #6309

Open asterite opened 6 years ago

asterite commented 6 years ago

Even though JSON::Serializable is a big improvement over JSON.mapping, having to include JSON::Serializable in every type feels like a burden.

In many other languages, for example, C#, Go, and I guess Java, serialization works out of the box through runtime reflection. We can do the same in Crystal, with compile-time reflection.

The way to do this is this: we define Object.from_json(string_or_io), on every Object. What this does is create an instance with allocate (unsafe), create a JSON::PullParser and pass it to an instance method named initialize_from_json. This method is very similar to the initialize(pull_parser) we have. Because this method will raise if a property is not found in the JSON source for a required property (a property without a default value, and not nilable), it's totally safe.

The benefit is not having to require JSON::Serializable in every object that needs to be serialized. You could do:

require "json"

record Point, x : Int32, y : Int32

# Works!
Point.from_json(%({"x": 1, "y": 2}))

and it will work right away. Serialization can still be controlled via attributes.

Take for example https://github.com/crystal-lang/crystal/pull/6308 . It's nice, but having to manually include the JSON::Serializable module feels redundant, especially for records and simple types.

Another benefit is that no module is included, which means #6194 won't trigger (for this case).

Yet another benefit is that imagine you define some simple types in your library. If people want to serialize them, they can do it right away, no need to reopen them and include JSON::Serializable in case the author forgot.

The diff to achieve this is very simple, ~except there's a compiler bug that prevents this from working (I can work on that)~ I have it working locally.

I'm pretty sure many will say it might be "dangerous" to allow serialization on every type. I don't find it like that. And given that there's precedent in basically every other language out there, I think it's fine.

Thoughts?

asterite commented 6 years ago

Another thing, right now the compiler complains in this case:

class Foo
  property x : Int32
end

The error is:

instance variable '@x' of Foo was not initialized directly in all of the 'initialize' methods, rendering it nilable

That's cool!

However, once we include JSON::Serializable that error is gone:

require "json"

class Foo
  include JSON::Serializable

  property x : Int32
end

# No error!

The reason is that the include defines an initialize method that makes sure to initialize all instance vars. So there's already an initialize doing that and the ability for the compiler to check that x wasn't initialized is lost. In fact... I just found a bug:

require "json"

class Foo
  include JSON::Serializable

  property x : Int32
end

p Foo.new # => #<Foo:0x104ff0ff0 @x=0>

This happens because of a compiler bug (what a surprise) related to these magic initializers that use @type inside them. There's a rule for making those initialize magically work, bypassing the usual checks. And it seems the checks are lost in other cases. I coded that specifically with JSON::Serializable in mind, and now I regret it because it's a complex thing to check, and it's magical.

By removing that "feature", the only way we can achieve this functionality is the way I propose above. It can be with a JSON::Serializable module or without one (I prefer it without one, less typing and works-out-of-the-box).

Even if we somehow fix the above bug, when we do Foo.new we'll get an error saying no overload matches... there'snew(JSON::PullParser). That includingJSON::PullParserdefines aninitializemethod for you is a bit annoying, because it now participates in overloads too! And because the lookup for initialize methods is a bit different than regular methods (initializeis not inherited unless the child type doesn't define new constructors) we have to do thatmacro included/inherited` hack.

I think this change will simplify things a lot:

Thoughts again? :-)

asterite commented 6 years ago

Ah... and the downside is that if you are defining models for consuming an API, you'll actually have to write a bit more:

require "json"

class SomeModel
  include JSON::Serializable

  property x : Int32
  property y : Int32
end

# Compiles

With the new way:

require "json"

class SomeModel
  property x : Int32
  property y : Int32
end

# Doesn't compile: neither x nor y were initialized in an `initialize`

The way to fix it is to define a constructor:

require "json"

class SomeModel
  property x : Int32
  property y : Int32

  # For example:
  def initialize(*, @x, @y)
  end
end

# Now compiles

If the model has a lot of properties, you might have to write a big constructor... Alternatively, define default values for some properties (if the API is guaranteed to return some properties, giving default values for them is acceptable because you know you'll get correct responses). Anyway, defining a constructor isn't that bad, you can just do def initialize(*, put all required properties here).

So, yeah, in some cases you'll have to write more, you won't have to write include JSON::Serializable but you'll have to write a constructor. But I think this is good, as it makes you think how the class should be initialized outside of JSON. Imagine someone decides to use that class for some reason, maybe in a test, and parsing it from JSON all of the time might not be ideal, so having a constructor is good.

straight-shoota commented 6 years ago

Thinking on a broader scale: JSON/YAML are probably the most important serialization formats. But there are many similar use cases. Should this be the recommended approach for other mappings (XML, Message Pack, DB bindings, etc.) as well? I can't think there would necessarily be any downsides to this. Other than a bunch of methods available on every object, which will often not be really useful unless it's actually supposed to support serialization.

RX14 commented 6 years ago

I think this is too magic. I'd like a way to do generic serialization/deserialization, but I don't think this is the way.

RX14 commented 6 years ago

The way serde does it it by describing a "universal data model" then everything just maps to that data model. And then that data model is mapped to JSON or YAML or whatever. I think we should do the same. And Serde requires explicit #[derive(Serializable)] to work, why shouldn't we require explicit includes?

bcardiff commented 6 years ago
  1. Regarding a constructor hack, maybe a better way to go is to allow a different way of declaring constructors that are not checked by the compiler and that is the programers responsibility to comply with some sound contract. That way there is no hack in the compiler, but a convention.

  2. I don't think serialization should be enabled for all types. I'm ok in adding serialization by default in records for example.

  3. Using the out-of-the-box serialization to avoid triggering #6194 is not worth it IMO.

RX14 commented 6 years ago

@bcardiff

1. I don't like that - self.allocate is the unsafe way of declaring a constructor. We just need to automate adding finalizers so all constructors are always obj = Class.allocate; obj.initialize(*args); obj. Then that becomes API.

2 and 3. Absolutely agree

asterite commented 6 years ago

I'm thinking that allocate should add the GC finalizer, maybe. Then that problem will be solved.

I think Rust needs a Serializable attribute because they don't have reflection. With reflection you don't need an attribute.

Serde looks nice, but I can't understand how it works. It seems you have to define how a language type maps to a serde type, and then how to serialize that type into the output, and maybe also the other way around. And then you have generic serialization attributes.

If we could go that way, I think it would be awesome. Less repetition if a type is to be serialized into multiple formats.

I wonder how's the efficiency, though. If you have an Array, you have to map it to a serde array? And then serialize that array? Maybe Rust can do that efficiently, I don't think we can, so it will double the number of allocations...

However, I still think that being able to serialize any type without including a specific module would be nice. Again, the reason languages don't support this is because they lack reflection and the power to do it.

Alternatively, if it can be implemented without a compiler hack, that's another solution.

RX14 commented 6 years ago

@asterite if you think of it in the JSON::Emitter and JSON::PullParser level - there's no efficiency problem. The "data model" is just a standardized interface to the pull parsers and emitters. So we'd have a Serialization::PullParser module with abstract methods which JSON::PullParser would implement and same for emitters. Then you could just do JSON.build do |emitter|; my_obj.serialize(emitter); end and ditto for pull parsers. Then shortcut that up in from_json and to_json and you don't even really have a big breaking change.

And +1 to allocate adding the finalizer.

asterite commented 6 years ago

@RX14 I still don't understand how you would avoid that indirection and those extra allocations. I guess I'd have to still a proof of concept implementation to understand it :-)

I was wrong about allocate adding the GC finalizer. new invokes allocate, then initialize, then adds the finalizer. But if initialize raises then the finalizer must not be added. So I guess we need a method that does all of that at once, as @bcardiff suggests.

kostya commented 6 years ago

i think include is not that bad, and this in ruby style also, when you extend class with some new behavior by include module to it. Adding it to Object would be more magical.

RX14 commented 6 years ago

@asterite because there's no indirection. It's just a set of methods pull parsers and emitters implement.

asterite commented 6 years ago

Then, how about we keep JSON::Serializable but the deserialization doesn't happen through a constructor? That way we don't have to use the {{@type}} hack, which also currently breaks other constructors (as I showed above). And that way we don't need to use the included and inherited hacks either.

kostya commented 6 years ago

sounds cool, but how it would look like? any example?

asterite commented 6 years ago

@kostya This commit from this branch has it working. As you can see:

kostya commented 6 years ago

i think initialize is not big problem because it also can be generated: https://github.com/kostya/auto_initialize?

asterite commented 6 years ago

@kostya Looks good! So simple :-)

Well, everyone (cc @RX14 @straight-shoota @ysbaddaden) let me know if you think this is a good direction (using from_json instead of new), and I can send a PR for JSON and YAML

RX14 commented 6 years ago

I'm not sure I like it. I'm not sure why we need to start adding all these fake constructors with this method. I also don't like that it means that any other library using this method now has to generate custom constructors too. So at the very least it should be coupled with a generic multi-format serialization interface. And making serialization implicit is a nono, we should have to include something every time. And then there's the cost of having everything change around all the time. People do write custom self.new(PullParser) methods, now either those manually written methods have to be unsafe and write a self.from_json and an initialize_from_json, or they have to write a self.from_json alias for their existing self.new.

And what about default initialized variables such as @foo = "" which were ignored in the parsing? Wouldn't they break with this new method?

And what are we trying to avoid here? Fixing a compiler bug? I don't get why we have to change things and write unsafe custom-named constructors?

asterite commented 6 years ago

@RX14 I already said that JSON::Serializable would stay, it just wouldn't be a constructor.

This is the main problem:

Alternatively, someone can try to fix that compiler hack, but the included/inherited hacks will still be there... and my bet is that this won't be fixed anytime soon (I wrote the code and I don't understand it anymore, so I doubt someone will understand it quickly).

Also, serialization in other languages doesn't involve adding new constructors to types. It's a bit weird that a constructor is injected for you. What you want is just a way to create that object from JSON, and that doesn't have to necessarily be a constructor.

Anyway, this is just a minor thing. People are doing fine with some existing compiler bugs and segfaults, so this has no rush.

RX14 commented 6 years ago

So the issue is:

Okay that's fine, but why does the new constructor need to be called self.from_json. Why not self.new and rename the initialize to initialize_from_json just the same? Is there anything special about methods called self.new? I thought there wasn't.

This way it's not a breaking change, just an implementation change.

asterite commented 6 years ago

Yes, if a type defines an initialize method, then all initialize and self.new methods are not inherited. So this doesn't compile:

class Foo
  # Let's assume this is the constructor that gets injected in JSON::Serializable
  def self.new(x, y)
  end

  def initialize(x : Int32)
  end
end

class Bar < Foo
  def initialize(x, y, z)
  end
end

Bar.new(1, 2) # This doesn't work

I'm not sure it's the right logic, it's hard because there's new and initialize where in many language there's just new and the constructor is a fixed construct (there's no self.new in other languages). But, well, it's how it works.

I think I'll close this issue... maybe someone will have to fix the compiler issues/hacks regarding this and we'll be fine.

RX14 commented 6 years ago

@asterite No, I really can't imagine how a constructor which is also able to inspect all the instance variables using macros could work. I think this should be removed.

I think that manually-defined self.new should be inherited normally, only the autogenerated ones from def initialize should not be inherited.

Besides, I don't know any other language which inherits global ("static") methods anyway. What's the usecase?

RX14 commented 6 years ago

And how can serialization make sense with inheritance anyway. I think it only makes sense for Java and C# to have "magic" serialization and have inheritance because they can always stick null in all of their variables. A subclass magically getting all it's ivars serialized without asking is a bit weird to me.

I think I'm starting to dislike JSON::Serializable entirely, perhaps the only problem with JSON.mapping was people treating them like objects and adding ivars which aren't serialized and inheritance and weird things. Maybe we should have just a json_record macro which is literally JSON.mapping but it generates the whole class and force people to use that. Perhaps it makes people stop adding behaviour and using inheritance with JSON stuff. If they need all of that, they can always wrap it. But then there's all that extra typing.

Sija commented 6 years ago

Besides, I don't know any other language which inherits global ("static") methods anyway.

What do you mean? Ruby does it, PHP also.

RX14 commented 6 years ago

We only do it because ruby does it. PHP is a joke.

Sija commented 6 years ago

@RX14 Java and Python do too... Please, check your facts before posting.

RX14 commented 6 years ago

Oh, I thought that java didn't inherit static methods. Never mind then. My bad.

asterite commented 6 years ago

I'm closing this. The state of the art for JSON/YAML looks good so far, even though there are some corner cases. Sorry for the noise.

bew commented 6 years ago

@asterite maybe wait to have other opinions from e.g @ysbaddaden and @bcardiff, before closing abruptly like you did?

asterite commented 6 years ago

Nah. I opened it, I can close it.

asterite commented 6 years ago

I mean, I'm no longer interested in this happening.

RX14 commented 6 years ago

I'm definitely interested in fixing the bug. And I think reverting the compiler patch that made the current JSON::Serializable work and then doing it manually using def initialize_with_json is fine. It's doable now.

asterite commented 6 years ago

Then you should open a separate issue. The original issue was to have from_json/to_json available in all objects. And you are right that initialize_with_json is unsafe, so that shouldn't be done.

straight-shoota commented 6 years ago

I like the idea of taking a step back from an immediate implementation. Let's think about this for some time. The custom annotations feature is quite young (and awesome!), let's experiment with that. We can explore alternate ways to implementing serialization - it couldn't hurt to look more deeply at other implementations. For now, the current way should be fine. I'm pretty sure we'll end up with some changes sooner or later, but time will tell.

straight-shoota commented 6 years ago

@RX14

Maybe we should have just a json_record macro which is literally JSON.mapping but it generates the whole class and force people to use that.

This sounds a bit like Java Beans. Might not a bad idea. But maybe it is. 🤷‍♂️

It's definitely beneficial to question the current (and proposed) approaches and think about new ideas.

ysbaddaden commented 6 years ago

Aparte (not exactly this topic) about Rust's Serde: it's "data model" is an API interface to a fixed list of supported data types (29 in total) such as integers, floats, strings, arrays and hashes (but also structs, enums, tuples, ...).

There are 2 traits (interfaces) : Serialize and Deserialize to be implemented. They don't have to allocate more memory than needed, they're just an interface. It's similar to our X::PullParser and X::Emitter, but with a standard interface (e.g. emitter.map instead of json.object or yaml.mapping or ...).

Having a standard interface for (de)serialization would be as wonderful as having a standard interface for IO.

ysbaddaden commented 6 years ago

Going back to this topic: Rust's Serde can automatically generate the implementation using an explicit #[derive(Serialize, Deserialize)], similar to include JSON::Serialization but it doesn't have to generate a constructor, because Rust's structs have a global initializer, for example Point(x: i32, y: i32).

j8r commented 5 years ago

Following the forum discussion, I got this basic working draft in this gist, which may or may not be the right approach.

At least, it shows an idea of a possible API.

struct Point
  getter x : Int32
  @[Serialization(key: "YYY")]
  getter y : String

  def initialize(@x, @y)
  end
end

data = %({"x": 1, "YYY": "abc"})

point = Point.from(JSON, data)
puts point # => Point(@x=1, @y="abc")
{YAML, JSON}.each do |type|
  puts point.serialize type
  # => {"x":1,"YYY":"abc"}
  # ---
  # x: 1
  # YYY: abc
end

Or, do we want to go through bytes for serializing, like Go marshaling?

Additional note

Having multiple serialization mechanisms (2 actually) isn't ideal, even less by adding another one. Having a plan to only keep 1, at most, would be a good to have before 1.0.

al6x commented 3 years ago

It's really inconvenient to write lots of do; include JSON::Serializable; end statements in the code base. Especially for the data objects.

Example some api, there are like 15 of data objects and for everyone you had to include do; include JSON::Serializable; end. It contributes zero meaning to the code and is just an annoying boilerplate noise.

straight-shoota commented 3 years ago

It contributes zero meaning to the code

That's not true. It explicitly declares these types as being serializable. That certainly does not apply to any type.

Automatically adding serialization methods for all types, including non-serializable types, is questionable at least. It taints the API with methods that are not good to be used.

I understand the record macro is typically used for simple data types that can also be expected to be serializable. So I think we could make types created by record automatically serializable. This was suggested before in this discussion. Record types themselves should generally be suitable for serialization. But that must also be true for the types of their instance variables, which simply can't be expected in the general case.

Maybe that's still okay, I'm not sure. There will be compile time errors if some types of instance variables are not serializable (although they're might not be very user friendly). Some basic serialization methods are even defined on Object 🤷 (#5695).

jhass commented 3 years ago
macro json_record(*args, **kwargs, &block)
  record({{*args}}, {{**kwargs}}) do 
    include JSON::Serializable
    {{block && block.body}}
  end
end
kostya commented 3 years ago

Auto serialize at least bad for security, imagine is someone accidentally store in some object link to internal object(with passwords or something else) for debug purposes. It silently would be serialized, without someone ever noticing it.

straight-shoota commented 3 years ago

@kostya This is an interesting point, but I'm not sure I agree. Configuration data like passwords is very much serializable and often serialization is employed to store such data. There may be some advantage if serialization is only explicitly made available as that could make the developer aware of that. But I don't think that's much relevant. It's really a problem of protecting sensitive data inside the application, not of sensitive data being serializable. There's not much security benefit in my opinion. Every object has #to_s and #inspect and they could already be leaking information if the object is available in an unsafe scope. There are legitimate reasons for serializing private data, so depending on serializability to indicate safety is insufficient.

straight-shoota commented 3 years ago

@jhass Yes, that's an easy solution. Do you think it should be part of stdlib or a tool for user code?

jhass commented 3 years ago

I wouldn't be 100% against json_record and yaml_record in stdlib, but I also think it's simple enough to include into application code as needed. The latter also gives the potential to tailor it further to the application's domain (support multiple serializations, add some common helper methods etc).

asterite commented 3 years ago

When I code in Haskell it's so tedious to say that such and such type can be mapped to and from json (there are two typeclasses for that in Haskell, ugh).

However, if any object is serializable to json then you can accidentally pass an incorrect object to something that serializes it, and it will compile but it won't work as expected. Of course you should test all code, or as much as possible, but making things as serializable is more type safe than not doing it.

kostya commented 3 years ago

@straight-shoota to_s and inspect known to be unsafe, noone would use it in apis. But serialize should be, you get object from db and serialize it to web.

straight-shoota commented 3 years ago

@kostya I wouldn't be so sure that #to_s is generally considered unsafe. It's used in string interpolation, for example. And serialized data is certainly not necessarily safe to put anywhere. The db object could contain exactly that kind of data you mentioned in your previous comment (confidential credentials for example).

Sija commented 3 years ago

@straight-shoota What @kostya is trying to tell you (excuse my mind-reading!) is that you don't want to unconditionally serialize all fields from the given record - which might come from the db for example - having fields like password, salt, or personally identifiable information (PII)

al6x commented 3 years ago

Not insisting, just a personal opinion, feel free to ignore

A fear that some non serialisable object will be serialised, reminds me a talk by DHH. When hi says, about the fear of method re-definition "People says that methods shouldn't be allowed to be redefined, what if programmer re-defines Array.map to shoot the firework and someone calls it unknowingly?".

What was the last time you saw a serious bug caused by serialising some object that's not designed to be serialised? I can't remember such case in my Ruby, JavaScript or Java practice.

The only serious concern is security. For things like passwords, but as someone mentioned, usually objects containing passwords are serialisable anyway. But if this concern is true, such object could be explicitly marked as non-serialisable, to avoid being visible in dumps, logs etc.