serialization

CeylonMigrationBot commented 11 years ago

[@gavinking] We need to define:

a set of rules that determine if a class is serializable (restrictions on it's attribute types, annotations, etc), and
an API, as part of the metamodel, which supports Serializers and serialization.

I'm thinking something along the lines of this:

A class is serializable iff:

it is annotated serializable,
it has no forward-declared methods or getters,
all its reference attributes are either serializable or shared, and
all its reference attributes are of serializable type.

All superclasses and all subclasses of a serializable class must also be serializable, except for Object and Anything.

But I'm sure I'm missing some details.

[Migrated from ceylon/ceylon-spec#704] [Closed at 2015-09-30 13:26:30]

CeylonMigrationBot commented 10 years ago

[@wdrai] Every kind of deserializer will probably have to do something like Class.forName(...) at some point to provide it to reference(id, clazz). Having a "String typeName" argument would just avoid having it in each and every deserializer library (and possibly optimize the "forName" by caching it in the core ceylon serialization). Anyway compile-time type safety is artificial at this point as the deserializer only knows what type of object it reads at runtime.

CeylonMigrationBot commented 10 years ago

[@fwolff] @FroMage: I'm trying to understand your JSON Printer code and I'm a bit confused by the redefinition of the Object class. Does it mean the user has to convert all regular objects into JSON objects before giving them to the Printer.printObject method?

Based on this sample usage, I guess I'm right:

String getJSON(){
    value json = Object {
        "name" -> "Introduction to Ceylon",
        "authors" -> Array {
            "Stef Epardaud",
            "Emmanuel Bernard"
        }
    };
    return json.string;
}

Question: with the help of the metamodel API, what would be the Ceylon code to iterate over the properties of a regular Object? Or, to put it another way, do we need to create a Deconstructed instance for each object to be serialized (which can lead to performance / memory issues) while we could simply iterate over its properties one by one?

CeylonMigrationBot commented 10 years ago

[@gavinking] @wdrai I don't think that's quite correct. At least, while it might ultimately turn out to be correct, we don't quite have sufficient proof of it yet.

You see, just because you have an unknown T doesn't mean that you're not gaining typesafety. For example, the API I proposed enforces that you can only pass Attribute<T,X> to a Deconstructed<T>, even if the client doesn't know what T is. I must admit I have not tried to put together an entire system to convince myself that this amounts to meaningful end-to-end typesafety and not just a theatrical use of generics, but my guess is that for at least some clients it would be meaningful.

For example, if the serializable attributes are identified by annotations, then the Attribute<T,X> refs are obtained directly from a Class<T> object, and even though we might not know what T is at compile time, we can still tell whether an Attribute<T,X> belongs to the Class<T>.

Finally, it's to me entirely imaginable that someone might want to use this API to serialize/deserialize objects using handwritten code, or machine-generated code (think ceylon.ast), or even, in future, a macro, where types are known statically! Reflective clients are not the only usecase for this stuff.

So, while it is indeed possible that this design might not quite pan out and that either the generics get in the way, or are of only theatrical value, that's definitely not clear to me yet.

CeylonMigrationBot commented 10 years ago

[@gavinking] @fwolff This code prints all the shared attributes of the Person class:

class Person(shared String name, shared String address) {}

print(type(Person("gavin", "")).getAttributes<Person>());

But this is almost completely useless to you right now, because we're actually also interested in private members, which is something I still need to take properly into account. Hence my question above asking:

@FroMage can the Attribute interface here be the same one we already have, even though in this case it will often be representing private attributes, or do we need a new one?

CeylonMigrationBot commented 10 years ago

[@FroMage] getDeclaredAttributes will also return private attributes, but only on the current type.

CeylonMigrationBot commented 10 years ago

[@FroMage] @fwolff: the JSON API only deals with JSON types, not arbitrary types. At least, ATM.

CeylonMigrationBot commented 10 years ago

[@fwolff] @gavinking & @FroMage got it, thanks.

I'm wondering if it would be possible to define a serializer as a kind of visitor:

interface Serializer {

    "Tell the serializer to start writing the given object.
     Returns true if the serializer needs to proceed with
     the properties of the object or false if it is going to
     write only a reference.
      eg. JSON will only write the '{' character while other
      serializers would write some binary id, followed by
      the fully qualified name of the object class and the
      property count"
    shared formal Boolean startObject(Object o, Integer propertyCount);

    "Tell the serializer to write the Integer [[value]] for the property
     identified by [[name]] of the current object.
     eg. JSON would write '<name>: <value>[,]'"
    shared formal void writeIntegerProperty(String name, Integer value);
    // Other *primitive* types...

    "Tell the serializer to start writing the given object property.
     This call would be followed by a call to startObject with the
     value of the property"
    shared formal void startObjectProperty(String name);

     "Tell the serializer to end the writing of the current object.
      eg. JSON will write the '}' character while others
      would possibly write some other kind of marker or
      nothing"
    shared formal void endObject();
}

Then, in the user code, we would have something like this:

 // JSONSerializer satisfies the Serializer interface.
JSONSerializer serializer = JSONSerializer(out);
SerializationContext context = SerializationContext(serializer);
// The context iterates over the object properties and
// calls the serializer methods accordingly.
context.serialize(theObjectToSerialize);

What do you think? What I like with this visitor-like idea is that the various serializers would only deal with how to write things and not anymore with how to introspect objects, etc.

CeylonMigrationBot commented 10 years ago

[@gavinking] @fwolff This was my first idea, but I think it's fundamentally much less powerful. Basically, the language module would need to take on almost all responsibility for implementing serialization, just leaving the most uninteresting details of writing characters to a string to the serialization lib. And I doubt it would be usable for people implementing an ORM library, for example.

CeylonMigrationBot commented 10 years ago

[@gavinking] Plus, it would require that we have annotations in the language module for defining which attributes are transient, and how would you distinguish transience between different externalization formats then, etc, etc. Much less flexible, it seems to me.

CeylonMigrationBot commented 10 years ago

[@fwolff] @gavinking you're right, I also think it would be less flexible. Let's forget about this one.

Well, to sum up my current understanding / thinking:

We basically need a reflection API and the Deconstructed interface can serve this purpose.

"The flattened state of an instance of [[Class]]."
interface Deconstructed<Class> 
        satisfies {[Attribute<Class>,Anything]*} {

    "Get the value of the given attribute.
     (no references here!)"
    throws (`class AssertionError`,
        "if the value is missing")
    shared formal 
    Type/*|Reference*/<Type> get<Type>(
        Attribute<Class,Type> attribute);
}
interface SerializationContext {

    "Introspect the given [[instance]] and returns its
     properties, so the serializer library can iterate on
     them and persist the values."
    Deconstructed<Class> deconstruct<Class>(Object instance);
}

The Deconstructed returned here should be immutable and shouldn't contain any reference. It is the responsibility of the serialization library to decide how to deal with references: JSON wouldn't do anything about reference but should at least check if the graph isn't circular and throw an error accordingly. Other implementations would write references based on their on a specific identity policy: strict identity with common objects, equality with strings and possibly id equality with entities.

Deserialization

That's the tricky part. Here we need the help of a core Ceylon module that is able to create a graph of objects from an intermediate representation (which can contain references, even circular).

The serializer would feed the context with the content of its input stream and finally ask the context to create the entire native graph that reflects the intermediate representation. I'm not sure the Deconstructed / [Stateful]Reference interfaces would adequately serve this purpose here.

Form the deserializer perspective, it would be great to have this kind of DeserializationContext:

interface DeserializationContext {

    shared formal
    Reference createReference(Object id, String className);

    shared formal
    Reference getReference(Object id);

    shared formal
    Anything resolve();
}

The Reference here doesn't have to be generic because there is nothing we can do in the deserializer with the actual type it is going to be resolved at the end. The Reference interface is basically representing a mutable collection of members (object properties, collection items, tuples, etc.) The implementation of the Reference interface would of course holds its id and type, but it doesn't have to be exposed externally.

interface Reference {

    "add a property or item to this object or collection reference"
    shared formal
    void addMember(Member member);
}

The Member interface would be a root marker interface, which should be extended to represent either an object property or a collection item (or a map entry, etc.) Of course, the member's value could be itself a reference.

At the end, the intermediate representation would then be a collection of References, the first one representing the root object. The resolve method would iterate on these references, creating a graph of native Ceylon objects.

This is not very different than the original proposal in a way: it is a 2-phases process and it ends by iterating on a collection of references in order to create the final representation of what was serialized. I just don't think that the last transformation should be implemented (and replicated) in each serialization library.

To put it another way: the DeserializerContext should expose methods that allow a deserializer (JSON or other) to construct a standard and untyped intermediate representation of what it finds in its input stream.

CeylonMigrationBot commented 10 years ago

[@emmanuelbernard] I have only looked at Gavin's original proposal (12 days ago). Apologies if these concerns are already addressed.

In the case of things that don't handle references nor circular references (say XML or JSON), things are probably harder than they should. Keeping an artificial reference id that has not real meaning is not easy. Especially reconstructing the reference at deserialization time.

Can StatefulReference.reconstruct() be called multiple times. Will it actually recreate several instances of the same object? To be clear what does StatefulReference.instance returns before I call reconstruct(). And who is supposed to call it and when?

I think you might need to add SerializationContext.reference(Object id) to get back an already registered StatefulReference while walking through the second phase (in case of a JSON like approach where references are not references but nested structures. I guess one could navigate the SerializationContext sequence manually but that looks like a bunch of work at first sight.

During deserialization, it seems that for each "reference" in the stream, you need to call DeserializationContext.reference(id).deserialize(myDeconstructedStateimpl) and keep the returned StatefulReferences as there is no direct way to get access to the StatefulReference. BTW is that correct that during deserialization, the library would provide its implementation of Deconstructed.

It's not clear to me if you could have a one pass implementation at deserialization time assuming your structure does not support explicit references. That would be a bit prohibitive for a JSON implementation.

I still think that a renaming of serialize / deserialize into hydrate / dehydrate makes a clearer distinction between what is presented here and what people mean by serialization.

Can a class influence which field is considered for persistence? Would it provide a Deconstructed implementation and how it would play with references.

The feedback is a bit disorganised but I hope it's still useful.

CeylonMigrationBot commented 10 years ago

[@gavinking]

In the case of things that don't handle references nor circular references (say XML or JSON), things are probably harder than they should. Keeping an artificial reference id that has not real meaning is not easy. Especially reconstructing the reference at deserialization time.

Agreed. The API is optimized for cases with identity. That's something that needs more thinking through.

Now, I happen to believe that there are always (natural) keys, even when they are not made explicit in a data format like XML or JSON. Of course, I realize that this makes me a member of the tiny minority who have actually taken the time to understand data modeling at a superficial level, while the entire rest of the industry is busy following the lemming in front of them over the "schemaless" cliff. Pity the folks who will have to come along in 5-10 years time and clean up the mess of lemming carcasses that is the inevitable consequence of this phenomenon.

Ah, doesn't this just take you back to ye olde days of the Hibernate forums, and all the guys with tables "with no primary key"?

Can StatefulReference.reconstruct() be called multiple times.

Sure, subsequent invocations are noops.

Will it actually recreate several instances of the same object?

No. It is the responsibility of the context to manage identity.

To be clear what does StatefulReference.instance returns before I call reconstruct().

instance is documented to call reconstruct() by side-effect. The client never sees an incompletely constructed object. That's one of the main goals of the API.

I think you might need to add SerializationContext.reference(Object id) to get back an already registered StatefulReference while walking through the second phas

Agreed.

BTW is that correct that during deserialization, the library would provide its implementation of Deconstructed.

Yes, correct.

It's not clear to me if you could have a one pass implementation at deserialization time assuming your structure does not support explicit references. That would be a bit prohibitive for a JSON implementation.

OK, we need to think about that.

Can a class influence which field is considered for persistence? Would it provide a Deconstructed implementation and how it would play with references.

Well, we need to think about what the rules for that are going to be. I have not really got down that far into the details. In principle, yes, that is one of the goals.

CeylonMigrationBot commented 10 years ago

[@FroMage] Actually even if we serialise to JSON we'll want to deal with circular references, and there are standardish ways to do that.

CeylonMigrationBot commented 10 years ago

[@fwolff] Based on @FroMage code (parse.ceylon), what would be a concrete JSON implementation of the parseObject method with the new API if it must support the _class: "path.to.MyBean" convention and return a path.to.MyBean instance?

I think a kind of POC based on some actual code would be very helpful at this point.

CeylonMigrationBot commented 10 years ago

[@FroMage] Well, I'm not sure at all if the standard generic JSON parser of ceylon.json must support serialisation of Ceylon types. I think the two parsers should be separate, since they serve different purposes and work differently.

CeylonMigrationBot commented 10 years ago

[@fwolff] I'm not saying that the standard JSON parser must support direct (de)serialization of Ceylon types. But if it could, what would be the implementation of the parseObject method?

Basically, I think it would:

Get the class of the object to deserialize based on a String (eg. the _class field).
Get a blank Deconstructed for the given class and populate it with the values, coerced to the strongly typed Ceylon properties.
Deal with references...

A short code snippet would help clarifying what we have to do in a concrete implementation of a serialization library, if it won't lead to code redundancy, etc.

CeylonMigrationBot commented 10 years ago

[@gavinking] @fwolff FYI, @tombentley has started work on implementing this API. It would be good if you guys could sync up somehow.

CeylonMigrationBot commented 10 years ago

[@fwolff] @gavinking I'm currently in the middle of nowhere (here). I'll be back next week on Wednesday and see how we can sync with @tombentley.

CeylonMigrationBot commented 10 years ago

[@gavinking] OK, coo, thanks.

CeylonMigrationBot commented 10 years ago

[@tombentley] @fwolff just ping me here when you're back. I'm hampered by terrible network connectivity right now, but maybe it'll be sorted by then. On 9 Aug 2014 19:38, "Gavin King" notifications@github.com wrote:

OK, coo, thanks.

— Reply to this email directly or view it on GitHub <#3810#issuecomment-51694879>.

CeylonMigrationBot commented 10 years ago

[@tombentley] As mentioned on IRC, I've been implementing this API and a serialization library based upon it. The API more or less works, though there are a few things I think could be improved, or are at least worth discussing.

I've used a separate Deconstructor interface for serialization and just used Deconstructed for deserialization. These interfaces are implemented by the serialization library.
I think it might be worth having separate interfaces SerializableReference, DeserializableReference whose deserialize() method returns an RealizableReference (or InstantiatiableReference or something). Right now I have two implementations of StatefulReference one for serialization and one for deserialization; it doesn't make sense to call serialize() on a StatefulReference obtained from a DeserializationContext.
In general it's possible for state in a super class to be visible even when the attribute is refined by a subclass, via super, like this:
```
class Super() {
  shared default String a = "super";
}
class Sub() extends Super() {
  shared actual String a="sub";
  shared String b => super.a;
}
```
This means that where Gavin has Attribute in his API I'm using ValueDeclaration. That makes the API slightly less typesafe than it was.

Anyway, this isn't really what I'm wanting to talk about right now...

Generic classes

I've implemented support for serializing generic classes. From the PoV of the API that means adding a method to Deconstructed for representing type arguments in the serialized state.

shared formal Type getTypeArgument(TypeParameter typeParameter);

(that's ceylon.language.meta.declaration::TypeParameter and ceylon.language.meta.model::Type, btw). From the PoV of the serialization library, it has to serialize those Types (so that on deserialization I can obtain corresponding TypeDescriptors and restore the reified type arguments). Right now my serialization library is sort of cheating: I've written a little parser and I serialize the Type.string representation, which I parse upon deserialization. I could in principle decompose the Type into ClassDeclaration, InterfaceDeclaration, unions and intersections, but it would be nicer if those things were themselves serializable.

aside: There seems to be no way, using the metamodel API, to intersect and union arbitrary Types.

serializing native classes

The problems with making the different Types serializable are:

they're all interfaces and
their implementations are not really a public part of the API and
those implementations are native/platform dependent, in particular they're chock-full of platform dependent fields which need initializing.

In other words we need a way to give them a well-defined serializable form (that works cross-platform) which isn't based on obtaining their underlying state directly. We could use annotations on some attribute(s) to declare what this state is (at serialization time). The problem comes in knowing how to transform that state into a properly constructed instances at deserialization time. In particular, in the presence of cycles between these things we would need a way to restore the state of partially constructed instances. That needs to be under user control, and yet without exposing the user to uninitialized instances, which is a paradox. In the absence of cycles there's no fundemental problem, but we'd need a way to construct and initialize the instance from the serialized form in one go.

Sequences

While we would probably expect serialization libraries to cope natively with things like Integer and String (decomposing them to Bytes if necessary), it starts getting patchy when we get to things like ArraySequence. One of the uses cases for the API is for serialization to relational databases and collections present a bit of a problem there. Consider things like ArraySequence<Integer|String> or ArraySequence<Person|Organization>: We'd need one table per ArraySequence type. Possible, I suppose, but it gets really messy when we come to Tuple.

We could say that Tuple is not serializable, but that seems quite a restriction and other serialization libraries wouldn't have a problem with it. So the serializability of a class depends not just on the nature of the class itself, but also on the capabilities of the serialization format (in the form of the serialization library). Really this is just a point about the compatibility of different type systems.

To do, and other random thoughts

Support for instances of member classes.
Do we need some kind of interface for classes which need to perform some kind of post deserialization logic. Specifically I'm thinking of the case where:
1. An instance of Foo is serialized to disk.
2. The initializer of Foo is changed, e.g. an assert is added, the code is recompiled
3. The instance of Foo is deserialized. Bang! What the programmer thought was an invariant is violated because the deserialized instance avoided the new assertion.

CeylonMigrationBot commented 10 years ago

[@gavinking]

Right now my serialization library is sort of cheating: I've written a little parser and I serialize the Type.string representation, which I parse upon deserialization.

To me this is not just OK, it's actually preferable, unless there's some reason to believe that performance would be much worse, which I doubt it would be.

Advantages to the string representation include:

it can't change (this is essentially language syntax defined in the spec)
it's very simple and compact, at least compared to serializing Types (though perhaps we should find a way to compress the package names)
it's much easier to read in a string-based serialized format like json.

So I don't think we should be trying to serialize the model objects.

CeylonMigrationBot commented 10 years ago

[@gavinking]

One of the uses cases for the API is for serialization to relational databases and collections present a bit of a problem there.

We've discussed this before, Tom, and I think what we concluded is that there are two very different usecases here:

serialization to an "internal" format, where we expect the concrete class of collections to be preserved, and
serialization to formats like relational data or XML where there is no first-class notion of a "collection", and you need to change your approach to start thinking in terms of "associations" between "entities". In this case you have annotations driving the handling of associations and there is no expectation on the part of the user that the concrete class of a collection would be preserved.

In the context of 2, something a Tuple is a special case that is better modeled as an entity, not an association, even though in the language it hangs off of the collection type hierarchy.

CeylonMigrationBot commented 10 years ago

[@sgalles] With this work, how far are we from being able to transport objects between JS and JVM backends ? Are there other missing parts ? Do you think this could make it for 1.1 ? @tombentley can we already test this work ? (I didn't see any commit related to this in the repos)

CeylonMigrationBot commented 10 years ago

[@gavinking] Slipping to a 1.1.5 release in October.

CeylonMigrationBot commented 10 years ago

[@gavinking] @tombentley I'm trying to figure out how to use the Deconstructor API, but I'm a little stuck. You can take a look here:

https://gist.github.com/gavinking/ca2fba39c73d9dc376ee

Basically, when I get to a contained object, I have a choice between:

creating a reference to it by id, or
embedding it in the current object.

In the first case, I could do it, but I would have to keep track of ids of things in my own Map. My original API used to let me obtain the id of a previously registered object, IIRC, but that doesn't seem to be possible now.

In the second case, I need to recurse the Deconstructor on the referenced object, but I can't see any obvious way to do that.

CeylonMigrationBot commented 10 years ago

[@tombentley] If SerializationContext had a getId() method then I never saw it, but I agree it should be possible to query the serialization context to get the id of an instance that's already been registered. So I assume we're talking about this:

"Gets the id that the given instance has been registered with, 
 or null if the given instance has not been registered."
shared Object? getId<Instance>(Instance instance);

As for your embedding objects, due to the design of my proof of concept serialization library that never occurred to me as a requirement (or I reasoned that it was the responsibility of the Deconstructor, since the API just seed everything as identified References).

CeylonMigrationBot commented 10 years ago

[@gavinking]

If SerializationContext had a getId() method then I never saw it

No it didn't have a getId() method, but I think it had a way to get the Deconstructed for an object without assigning an id.

CeylonMigrationBot commented 9 years ago

[@tombentley] @gavinking checkout the serialization branch of ceylon.language. There's also https://github.com/tombentley/jsonsl/ which you may, or may not be interested in. Note that @chochos hasn't yet had a chance to update the JS language module, so it's JVM only right now.

CeylonMigrationBot commented 9 years ago

[@quintesse] I like how easy the use of the library is! Nice work.

CeylonMigrationBot commented 9 years ago

[@EricSL] If jsonsl is any indication of what this is intended to support, this proposal seems to be going in the wrong direction.

There are a few key things serialization libraries need to get right:

Cross version compatibility -- think about things like what happens when the code changes; can a field change from type T? to Sequence, how do you specify default values in case it's missing.
Cross language compatibility -- a JSON parser needs to work with JSON the way others are using JSON. Representing things in a Ceylon-specific way, by for example including the names of the Ceylon types, is a bad pattern here. Likewise, serialized names that aren't valid Ceylon field names need to be supported, and case insensitive field names may be desirable.)
Security -- Assume the serialized version could be maliciously created; don't instantiate any objects that aren't expected.

(Maybe you're shooting for something more along the lines of Python's pickling, but if so it will be of much more limited use. If not, these concerns need to override completeness concerns.)

What's not so important is serializing graphs. There isn't one conventional way to do this, so you'll have trouble with compatibility with the various ways people encode them now. Instead expect that developers will translate their graphs to/from a serializable acyclic intermediate form that just represents the data, and the serialization library will make it easy to serialize that.

What's not so important is supporting inheritance. Due to security concerns, if the schema says you are deserializing type T, it is not okay to deserialize a subclass of T. It can be okay if the schema explicitly lists the supported subtypes, but if so you need a cross-language compatible way of specifying what the type is. For example, annotations might say that field foo will contain type T if the JSON contains a field "foo_t", or a type U if the JSON contains a field "foo_u". Something equivalent to the oneof feature in protobuf could be an alternative to inheritance: https://developers.google.com/protocol-buffers/docs/proto#oneof Inheritance seems natural in XML but like I said you want to be explicit about what classes you are expecting to deserialize.

It may be helpful to distinguish between root-serializable and field-serializable. I don't think the standard collections should support serialization directly. However, it should obviously be supported to have serializable classes with fields of type Sequence where T is also serializable.

It may be helpful to start from another language's API just because then there's a decent chance you'll interoperate with that language. C# has an annotation based serialization API, and it was designed around XML, but it works quite well for JSON: https://msdn.microsoft.com/en-us/library/bb410770%28v=vs.110%29.aspx

CeylonMigrationBot commented 9 years ago

[@gavinking] @EricSL

If jsonsl is any indication of what this is intended to support, this proposal seems to be going in the wrong direction. .... There are a few key things serialization libraries need to get right:

I don't understand this comment at all. The current API externalizes the listed concerns (among others) to the serialization library itself, and quite deliberately avoids addressing this kind of concern in the language module!

Maybe you're shooting for something more along the lines of Python's pickling, but if so it will be of much more limited use.

Well that is what jasonsl is, but it is not the only thing that the language module serialization API is capable of supporting.

I don't think you've understood the architecture of this.

It may be helpful to start from another language's API

Yew. All other languages handle serialization really badly, in my experience.

C# has an annotation based serialization API

Sure, and the current API can certainly support serialization libraries which are annotation driven. Indeed, that is a central goal. But we certainly don't want to bloat out the language module with annotations for controlling serialization!

CeylonMigrationBot commented 9 years ago

[@gavinking] Note that, indications at present are that the current API is too general-purposed, and perhaps won't be capable of supporting reasonable performance. So we might need to scale back our vision here, and provide something much less general-purpose.

CeylonMigrationBot commented 9 years ago

[@sirinath] One thing I can request is make it flexible as to make this at compile time resolution of runtime resolution. You can look at the AST of the object to be serialised and map into the serialisation format at compile time and in runtime you will need some form of fast reflection.

Ref: https://github.com/scala/pickling, https://github.com/heathermiller/spores. This might be a good starting point to see how to do this.

CeylonMigrationBot commented 9 years ago

[@gavinking] I'm closing this because, in principle, it's done.

We can open new issues for any additional tasks, which probably don't anyway affect ceylon-spec.

eclipse-archived / ceylon

serialization #3810

Serialization

Deserialization

Generic classes

serializing native classes

Sequences

To do, and other random thoughts