serialization - Githubissues

gavinking commented 11 years ago

We need to define:

a set of rules that determine if a class is serializable (restrictions on it's attribute types, annotations, etc), and
an API, as part of the metamodel, which supports Serializers and serialization.

I'm thinking something along the lines of this:

A class is serializable iff:

it is annotated serializable,
it has no forward-declared methods or getters,
all its reference attributes are either serializable or shared, and
all its reference attributes are of serializable type.

All superclasses and all subclasses of a serializable class must also be serializable, except for Object and Anything.

But I'm sure I'm missing some details.

gavinking commented 11 years ago

Note that Integer, Float, String, Character, Entry and Sequential would all be serializable.

RossTate commented 11 years ago

What about generic classes? Won't their serializability depend on the type argument?

gavinking commented 11 years ago

Hrm. That's interesting. Indeed, not every Sequential is serializable. :/

gavinking commented 11 years ago

Sadly, this issue is slipping to 1.1 :-(

lucaswerkmeister commented 10 years ago

How would a serialized object look like? Could ceylon/ceylon-sdk#125 maybe be the deserializer?

akberc commented 10 years ago

Some thoughts based on testing clustered web apps and commissioning full clustered test environments to test apps. Based on Ceylon philosophy of doing the obvious and declaring the rest:

serializable annotation at the module or package level
all objects (except Object and Anything) are serializable. All Ceylon modules and SDK will be.
compiler and IDE warnings/errors if a program element is not readily serializable and occurs in a package/module that should be serializable.
a schema annotation that can override any built-in serialization. This would be in line with current state of serialization like Thrift, Avro etc. This would allow two versions of the same class to have the same serialization even if a field had been added.

This approach would be better than Java and would save much money and effort for enterprises, as well as enable cross-VM as well as cross-language serialization.

lucaswerkmeister commented 10 years ago

all objects (except Object and Anything) are serializable. All Ceylon modules and SDK will be.

Can that work in practice? What’s the meaning of a serialized File.Writer, TestRunner or Callable?

akberc commented 10 years ago

Sorry, I meant ceylon.collections and other data-like modules or packages.

lucaswerkmeister commented 10 years ago

Oh, I see. Still, those can only be serializable iff their elements are serializable.

lucaswerkmeister commented 10 years ago

[collections] can only be serializable iff their elements are serializable.

I wonder if that can be represented in the type system?

interface Collection<Element, Serializability=Anything>
        satisfies {Element*}&Serializability
        given Element satisfies Object&Serializability
        given Serializability of Serializable|Anything {
    // ...
}

(That looks like some deranged monstrosity. Is there a better way?)

gavinking commented 10 years ago

I wonder if that can be represented in the type system?

I think this is a good use for annotations, not for inheritance.

lucaswerkmeister commented 10 years ago

I think this is a good use for annotations, not for inheritance.

True, but that means that the serializer can’t use the object’s internals. OTOH, the _de_serializer needs to be external anyways, and without constructors the internals aren’t of much use as well. (This is assuming @akberc’s schema – if the (de)serializer is language-internal and can’t be modified, it can be as internal as it wants, of course.)

pthariensflame commented 10 years ago

This is really the ideal kind of situation for using type classes: see Haskell's binary package, for example.

gavinking commented 10 years ago

Finally, here's a very strawman proposal for the central interfaces:

import ceylon.language.meta.model {
    Attribute
}

"A reference to an instance of [[Class]], with a certain 
 [[identifer|id]]."
interface Reference<Class> {

    "The unique identifier of the instance."
    shared formal 
    Object id;

    "Associate the given [[state]] with the instance, 
     returning a [[StatefulReference]]."
    shared formal 
    StatefulReference<Class> deserialize(
        Deconstructed<Class> state);

}

interface StatefulReference<Class> 
        satisfies Reference<Class> {

    "Get the flattened state of the instance."
    shared formal 
    Deconstructed<Class> serialize();

    "Get the instance. During deserialization, could force 
     reconstruction"
    throws (`class AssertionError`,
            "if there is a problem reconstructing the object
             or any object it references")
    shared formal 
    Class instance;

    "Force reconstruction of the instance."
    throws (`class AssertionError`,
        "if there is a problem reconstructing the object
         or any object it references")
    shared formal void reconstruct();

}

"The flattened state of an instance of [[Class]]."
interface Deconstructed<Class> 
        satisfies {[Attribute<Class>,Anything]*} {

    "Get the value of the given attribute."
    throws (`class AssertionError`,
        "if the value is missing")
    shared formal 
    Type|Reference<Type> get<Type>(
        Attribute<Class,Type> attribute);

}

"A context representing serialization of many objects to a 
 single output stream. The client is responsible for 
 registering the objects to be serialized with the context, 
 assigning them each a unique identifier. Then, the 
 serialization library is responsible for iterating the 
 registered objects in the context and persisting their 
 [[deconstructed states|Deconstructed]] to the output 
 stream."
interface SerializationContext 
        satisfies {StatefulReference<Object>*}{

    "Create a reference to the given [[instance]] of 
     [[Class]], assigning it the given [[identifer|id]]."
    throws (`class AssertionError`,
        "if there is already an instance with the given
         identifier")
    shared formal 
    StatefulReference<Class> reference<Class>(Object id, 
        Class instance);

}

"A context representing deserialization of many objects from
 a given input stream. The serialization library is 
 responsible for processing the stream and registering the
 [[deconstructed states|Deconstructed]] of the objects with
 the context. Then, it may obtain a reference to a fully
 deconstructed object via [[StatefulReference.instance]],
 and return it to the client."
interface DeserializationContext 
        satisfies {Reference<Object>*} {

    "Obtain a reference to the instance of [[Class]] with 
     the given [[identifer|id]]."
    shared formal 
    Reference<Class> reference<Class>(Object id);

}

Note:

The implementations of SerializationContext and DeserializationContext will be provided by the language module.
Possibly even with the help of compiled code under the covers if that would improve performance!

Questions:

Does this look basically sane?
@FroMage can the Attribute interface here be the same one we already have, even though in this case it will often be representing private attributes, or do we need a new one?
Is the use of generics here actually adding any substantial typesafety, or is it just "generics theater"?
Is there anything here that would make performance suck ass?

quintesse commented 10 years ago

Sane? I'm not sure, could you give some basic usage examples perhaps? For example I find it not immediately obvious why DeserializationContext.reference() would return a Reference<> which has deserialize() method which returns a StatefulReference<> which itself is another Reference<>.

lucaswerkmeister commented 10 years ago

I’m somehow having a very hard time understanding this. One question: Where does any user code come into play? You say that “the” (not “default”?) De/SerializationContext implementations are provided by the language module; in addition, you have completely detached the serialization mechanism from the classes it serializes (they don’t need to have a serialize() method or something like that… which is probably good). So how does SerializationContext create a StatefulReference?

In fact, I don’t see where this ends at all. Deconstructed’s get returns a Type|Reference<Type>, and a Reference is deserialized with another Deconstructed – so it seems it’s Deconstructeds “all the way down.” If, for example, I want to serialize any data structure to a String – how do I do that?

gavinking commented 10 years ago

The goal of this API is to flatten a graph of objects into a set of tuples of their attributes, or unflatten a set of tuples of attributes into objects.

Will you guys stop obsessing over how to write strings? We've been through this before. Writing strings is the easy part. Anyone can turn a bunch of tuples into a string. The hard part is deconstructing a graph of objects, or constructing one, while bypassing the initializers of the objects and visibility checks of the language.

I'm not even interested in strings per se. For me the most interesting kind of (de)serialization if from/to a database.

gavinking commented 10 years ago

One question: Where does any user code come into play?

I also don't care about user code. This provides support for frameworks. For example, JSON libraries, ORM libraries, whatever.

lucaswerkmeister commented 10 years ago

One question: Where does any user code come into play?

I also don't care about user code. This provides support for frameworks. For example, JSON libraries, ORM libraries, whatever.

(Within ceylon-spec, I consider these user code as well.) It seems to me the intended use is

value toSerialize = theThingIWantToSerialize;
SerializationContext context = TheCeylonLanguageImplementationOfSerializationContext();

value deconstructed = context.reference(1, toSerialize).serialize();
// where do I put deconstructed?

// elsewhere

DeserializationContext deContext = TheCeylonLanguageImplementationOfDeserializationContext();

value deserialized = deContext.reference(1).deserialize(deconstructed).instance;
// where did I get deconstructed from?

Where did the JSON library come in?

Can you give me a usage example?

gavinking commented 10 years ago

Yes, that's exactly right. This is code that occurs in your JSON library.

lucaswerkmeister commented 10 years ago

Ah, so

these interfaces aren’t supposed to be used by the “end user”
instead, the JSON library offers a serialization function that repeatedly references what it gets from the Deconstructed until the returned Type is “native enough” that it can be serialized directly?

And get returns Reference<Type> iff the object was already referenced (so the JSON library would know its ID)?

I’m not sure how useful Object id is… if it really was an arbitrary object, in order to save something completely, I’d have to serialize the id as well, wouldn’t I? Type parameter perhaps? (And most people would use Integer or String.)

gavinking commented 10 years ago

these interfaces aren’t supposed to be used by the “end user”

No, not really.

I’m not sure how useful Object id is… if it really was an arbitrary object, in order to save something completely, I’d have to serialize the id as well, wouldn’t I? Type parameter perhaps? (And most people would use Integer or String.)

In the flattened form you work with References to instances, not the instances themselves, since you might have a partial graph at any point in time.

And get returns Reference<Type> iff the object was already referenced (so the JSON library would know its ID)?

The language module doesn't care what you use for ids, so Object is fine here.

quintesse commented 10 years ago

I’m not sure how useful Object id is

I was thinking the same. What's the use-case for serializing something that's basically a Map of is->object and then supporting random-access deserialization of individual objects from that Map? Supposedly they are interrelated so possibly you can't cherry-pick that easily.

To me serialization and deserialization seem to be one-shot operations. If you need to serialize a graph of objects you pass the "root" of that graph and the rest gets pulled in automatically. If there's no real root but you still want to serialize a bunch of objects you put them in a collection and serialize that. If you need to recognize them somehow you put them in a Map and serialize that.

I'm guessing you have an entirely different idea about all of this @gavinking but from looking at the API I can't guess what it is, I need more information before I can opine if this is a sane basis for our serialization.

gavinking commented 10 years ago

Huh? How can you deserialize a graph of objects if you can't access an instance by id while reconstructing the graph? This thing has to support referential identity!

sgalles commented 10 years ago

Speaking of difficult problems, this one has now been a pain for ten years in the JDK http://bugs.java.com/view_bug.do?bug_id=4957674 Just wondering if this problem of unstable hashcode could affect this interface, or if it is just a matter of implementation in this case.

gavinking commented 10 years ago

Well I don't think my proposal is vulnerable to that problem, since I require the client code to assign an identifier to each instance. I never use its hashcode.

quintesse commented 10 years ago

I don't think that problem has anything to do with client-assigned identifiers or not. It's about complex objects that need to do internal (re-)initialization based on incomplete data. Part of that could be prevented by first re-creating as much of the object graph as possible and then have a special initializer on each object do the rest of the work, I guess.

quintesse commented 10 years ago

I'm not really sure about the DeserializationContext, it seems to have too little information to perform a deserialization. First, where do the IDs come from that you use to obtain a Reference? But then to get an actual object you have to pass it the Deconstructed related to it, but if that has any references to other objects how will it be able to reconstruct those references?

gavinking commented 10 years ago

First, where do the IDs come from that you use to obtain a Reference?

From the serialized format.

But then to get an actual object you have to pass it the Deconstructed related to it, but if that has any references to other objects how will it be able to reconstruct those references?

You have their ids, and you obtain a reference for the deserialization contest. That's why a Deconstructed holds values and references.

quintesse commented 10 years ago

Please, just give an example how you see this work. Just something simple with steps how you see the round trip from object to DB/File/whatever and back, because I just fail to see the whole picture here.

gavinking commented 10 years ago

You get a bunch of ids with related values that you read from some input stream. You turn your ids into references by calling DeserializationContext.reference() you construct tuples (Deconstructeds) comprising ids of related objects and primitive values and call deserialize on the references one by one. At the end, you have a bunch of StatefulReferences, and you can turn any one of them into an object by calling instance on it, which reconstructs the part of the object graph that is referenced from that instance.

Serialization is the same thing in reverse. You register instances one by one with the SerializationContext, and then, when you're done registering, you can turn them into tuples by calling serialize().

fwolff commented 10 years ago

Hello guys,

Gavin invited me to give some feedback about this API proposal.

I have tried to understand it by implementing a small prototype in Java. You can find the project here: https://github.com/fwolff/jeylon (Javanized Ceylon?!). This small prototype is very limited but it could serve as a concrete sample of what could be a working API / implementation. There is a Junit test case that shows a full (de)serialization process, handling both object and string references: https://github.com/fwolff/jeylon/blob/master/src/test/TestAlpha.java.

The prototype takes care of one the biggest issue (feature?) of Ceylon when deserializing objects: Ceylon doesn't allow the creation of "blank" objects, ie. without all concrete properties passed to the constructor. The deserialization must then be a 2-phases process, the first one collecting all references and destructured properties, the second one actually creating the graph of objects returned to the user.

However, I think a 2-phases process isn't required during serialization: references handling can be purely internal in the serializer library, it doesn't need to be exposed to the low level Ceylon serialization API.

Based on my prototype and my current understanding of Ceylon (near zero) and this API, I would suggest to simplify it as follow:

"A reference to an instance of [[Class]], with a certain 
 [[identifer|id]]."
interface Reference<Class> {

    "The unique identifier of the instance."
    shared formal 
    Object id;

    "Associate the given [[state]] with the instance, 
     returning a [[StatefulReference]]."
    shared formal 
    StatefulReference<Class> deserialize(
        Deconstructed<Class> state);
}
interface StatefulReference<Class> 
        satisfies Reference<Class> {

    "Get the instance. During deserialization, could force 
     reconstruction"
    throws (`class AssertionError`,
            "if there is a problem reconstructing the object
             or any object it references")
    shared formal 
    Class instance;
}
"The flattened state of an instance of [[Class]]."
interface Deconstructed<Class> 
        satisfies {[Attribute<Class>,Anything]*} {

    "Get the value of the given attribute."
    throws (`class AssertionError`,
        "if the value is missing")
    shared formal 
    Type|Reference<Type> get<Type>(
        Attribute<Class,Type> attribute);

}
interface SerializationContext {

    "Introspect the given [[instance]] and returns its
     properties, so the serializer library can iterate on them and
     persist the values."
    Deconstructed<Class> deconstruct<Class>(Object instance);
}
interface DeserializationContext {

    "Obtain a reference to the instance of [[Class]] with 
     the given [[identifer|id]]."
    shared formal 
    Reference<Class> reference<Class>(Object id);

}

I'm pretty sure I'm missing many things here, but I hope this kind of concrete feedback can be helpful.

Note: this prototype cannot deal with circular references. I didn't try to emulate the late keyword in Java, so the Parent/Child model used in my test case isn't circular and the deserialization of a circular graph will eventually fail miserably.

Franck.

gavinking commented 10 years ago

@fwolff So the big difference is that the SerializationContext returns the Deconstructed directly in one step?

So the issue with that is how does it go about deconstructing references that the object has to other objects? It has to have some way to figure out what the ids of the associated objects are. If they have not yet been registered with the serialization context, and assigned an id, you're going to have to have some per-class getId() strategy that you register with the context. Well, perhaps that's better; i'm not sure.

fwolff commented 10 years ago

At serialization time, there is no references: the Deconstructed returned by the SerializationContext only contains the actual property values of the bean to be serialized. It is the serialization library responsibility to figure out if and how it is going to persist references instead of the full state of the bean (eg. a JSON library will certainly not persist references, while other libraries could).

In my prototype, the id strategy (which is very common) is a HashMap<String, Integer> for string references and a IdentityHashMap<Object, Integer> for objects (see https://github.com/fwolff/jeylon/blob/master/src/alpha/AlphaSerializer.java). There is no need to delegate to Ceylon the handling of such ids.

So basically, the SerializationContext is just a Reflection / Introspector utility.

Sorry if I speak Java here, my knowledge of Ceylon is, as I said, near zero... Question: does Ceylon have something like HashMap / IdentityHashMap that could be used in a serialization API?

gavinking commented 10 years ago

At serialization time, there is no references: the Deconstructed returned by the SerializationContext only contains the actual property values of the bean to be serialized. It is the serialization library responsibility to figure out if and how it is going to persist references instead of the full state of the bean (eg. a JSON library will certainly not persist references, while other libraries could).

Well OK, the point about JSON is well-taken. But I think this is still more a question about division of responsibilities, that is, what code as the responsibility for "linearizing" the object graph. I was assuming that this would be the job of the SerializationContext (which is why, incidentally, mine is iterable). But your point is that:

some kinds of output streams don't linearize (JSON)
some kinds of output streams have very specialized rules for linearization (relational databases)

Interesting.

lucaswerkmeister commented 10 years ago

I have a different question: Is it useful that you can deserialize a StatefulReference? The following hierarchy would make more sense to me:

interface Reference<Class> of StatelessReference | StatefulReference {
    shared formal Object id;
}
interface StatelessReference<Class> satisfies Reference<Class> {
    shared formal StatefulReference<Class> deserialize(Deconstructed<Class> state);
}
interface StatefulReference<Class> satisfies Reference<Class> {
    shared formal Class instance;
    shared formal Deconstructed<Class> serialize();
    shared formal void reconstruct();
}

(EDIT: the important point being the separation of StatelessReference and StatefulReference, and that StatefulReference no longer has deserialize().)

fwolff commented 10 years ago

@lucaswerkmeister: no, it doesn't make sense to me as well. That's why my implementation throws a UnsupportedOperationException (https://github.com/fwolff/jeylon/blob/master/src/org/jeylon/serial/impl/StatefulReferenceImpl.java).

gavinking commented 10 years ago

@lucaswerkmeister well, you already have its state as a tuple, why prevent them from getting at it?

lucaswerkmeister commented 10 years ago

WDYM? I don’t prevent any getting, I prevent you from stuffing even more state into a Reference that already has state.

Or do you mean that my StatefulReference lost serialize()? That’s just because I copied + adapted @fwolff’s code instead of yours (less scrolling).

gavinking commented 10 years ago

Oh, ok, sure. Fine.I misunderstood.

lucaswerkmeister commented 10 years ago

Okay, I added it again (+reconstruct()) to avoid confusion.

fwolff commented 10 years ago

I'm still trying to further simplify the API and I'm thinking about something like that for the deserialization (serialization isn't the main problem here):

"The flattened state of an instance of [[Class]]."
interface Deconstructed<Class> 
        satisfies {[Attribute<Class>,Anything]*} {

    "Get the value of the given attribute."
    throws (`class AssertionError`,
        "if the value is missing")
    shared formal 
    Type|Reference<Type> get<Type>(
        Attribute<Class,Type> attribute);
}
interface Reference<Class> {

    "The unique identifier of the instance."
    shared formal 
    Object id;

    "The class of the instance."
    shared formal 
    Class type;

    "The flattened state of the instance."
    shared formal 
    Deconstructed<Class> state;
}
interface DeserializationContext {

    "create and return a new reference with an empty state,
     after adding it to the context"
     throws (`class AssertionError`,
        "if there is already a reference with the given [[id]]")
    shared Reference<Class> add(Object id, String typeName);

    "get the reference from the context with the
     given [[id]]"
     throws (`class AssertionError`,
        "if there is no reference with the given [[id]]")
    shared Reference<Class> get(Object id);

    "resolve all references and returned the
     first reference as a fully qualified object"
     throws (`class AssertionError`,
        "if any reference can't be resolved")
    shared Object /* Anything? */ resolve();
}

The idea is that the deserializer implementation is just filling the context with new references (id, className, state) and then, after reaching the end of the input stream, simply asking the context to create the graph of objects.

From my previous code, that would be something like:

public Object read() throws Exception {
    Object o = readNoInstance();
    if (o instanceof Reference)
        return context.resolve();
    return referenceOrObject;
}

...

private Object readObject() throws Exception {
    int type = in.readByte();

    if (type == REFERENCE_TYPE) {
        int ref = in.readInt();
        return context.get(ref);
    }
    if (type == PLAIN_TYPE) {
        String className = (String)readNoInstance();
        Reference reference = context.add(referenceIndex++, className);

        int count = in.readInt();
        for (int i = 0; i < count; i++) {
            String name = (String)readNoInstance();
            Object value = readNoInstance();
            reference.deconstructed.add(new AttributeImpl(cls, name), value);
        }

        return reference;
    }
    throw new RuntimeException("Huh...");
}

It is then the job of the context to reconstruct and instantiate the whole graph of the objects, with or without circular references, and the implementation of serialization library would be much simpler.

What do you think? Does it make sense?

F.

gavinking commented 10 years ago

@fwolff This looks less typesafe than my original version, isn't it?

String typeName (!)
Deconstructed<Class> state; what if this has not been set yet? What happens? AssertionError? Null??

gavinking commented 10 years ago

i.e. what I liked about Reference``Stateful reference is that they captured the state of the deserialization of an instance into the typesystem.

You could even do:

    Reference<Foo> ref = dc.reference<Foo>(id);
    if (is StatefulReference<Foo> ref) {
        //already had its state deserialized 
        Deconstructed<Foo> tuple = ref.serialize();  //retrieve the previously registered state
    }
    else {
        //ref is an "empty" Reference<Foo>
    }

So there was a whole nice protocol for interaction between the context and the client. I think you've lost that.

FroMage commented 10 years ago

Typesafe or not, as long as a framework can serialise types it doesn't know about (Anything), then it should be fine.

gavinking commented 10 years ago

P.S. This kind of thing:

shared Reference<Class> add(Object id, String typeName);

Is not usually right. Ceylon has reified generics, so you can write:

shared Reference<Clazz> add<Clazz>(Object id);

And inside the body of add(), Class is a real reified type that you can inspect. You can even do:

Map<ClassDeclaration<Object> refs = .... ;
assert (is Class<Clazz> clazz = `Clazz`);
refs.put([clazz.declaration, id], ref);

gavinking commented 10 years ago

Typesafe or not, as long as a framework can serialise types it doesn't know about (Anything), then it should be fine.

Ah yes in fact the calling code doesn't know what the type is at compile time, so it should be:

Reference<Clazz> reference<Clazz>(Object id, Class<Clazz> clazz);

gavinking commented 10 years ago

Not sure if the variance of that is correct though.

gavinking commented 10 years ago

Ohyes, it's correct.

gavinking commented 10 years ago

Ah yes in fact the calling code doesn't know what the type is at compile time, so it should be:
Reference<Clazz> reference<Clazz>(Object id, Class<Clazz> clazz);

In fact, something that we don't currently allow in the language, but I can't think why not, is:

Type<T> type = .... ;
Refrence<T> ref = context.reference<type>(id);

i.e. we should be able to pass a Type object as a type argument. (We would need a slightly more special syntax than what I show above though.)

ceylon / ceylon-spec

serialization #704