Wiki should document references

GoogleCodeExporter commented 8 years ago

The wiki doesn't mention calling reference(). It is completely out of date as 
it mentions Serializer having a create method!

Original issue reported on code.google.com by nathan.s...@gmail.com on 30 Nov 2012 at 5:44

GoogleCodeExporter commented 8 years ago

Hi Nate,

While I totally agree that current semantics of references should be 
documented, I think something could be done in the core Kryo logic to improve 
the situation and make using them less error-prone for users and serializer 
writers. I haven't looked too deep into it yet, but my impression was that Kryo 
registers references too late (after object was completely read). This leads to 
problems in situations, where objects have sub-objects, etc. 

I think something could be done, if Kryo would "pre-register" objects very 
early (i.e. almost  after reference was read from a stream) and later patch 
them with proper objects. Or something along these lines. What do you think?

-Leo

Original comment by romixlev on 30 Nov 2012 at 6:35

GoogleCodeExporter commented 8 years ago

I'm all for improving it if we can come up with something that works better. :)

The complexity is mostly during deserialization because Serializer#read() both 
creates the new object and may use Kryo to deserialize other objects that may 
reference the new object.

I'll list some use cases. I hope I get the current behavior correct!

1) Serializer#read() creates a new object and returns it without deserializing 
other objects via Kryo.
Current behavior: Nothing special needs to be done, Kryo detects reference() 
was not called and stores a reference to the object.

2) read() creates a new object then deserializes other objects via Kryo. The 
other objects are sure to never reference the new object.
Current behavior: Nothing special needs to be done, Kryo detects reference() 
was not called and stores a reference to the object.

3) read() creates a new object then deserializes other objects via Kryo. The 
other objects may reference the new object.
Current behavior: reference() has to be called before deserialization of the 
other objects.

4) read() deserializes other objects via Kryo then creates a new object (eg 
DefaultSerializers.TreeMapSerializer, but update first). The objects 
deserialized first cannot reference the new object.
Current behavior: other objects are deserialized via Kryo, then the new object 
created, then see use cases 1, 2, or 3.

Are there other use cases? Let's list them all so we can see where the current 
behavior is less than ideal.

These seemed to be solved pretty well currently, though there is the downside 
that if you fail to call reference() for an object and then try to serialize 
objects that reference the object it will fail. At one point calling 
reference() before using Kryo to deserialize other objects was enforced with an 
exception, but this ended up getting removed. I don't remember the reason why, 
but I think it was too clunky for use case #4, which required references to be 
disabled temporarily.

I didn't understand what you meant by "pref-register objects very early and 
later patch them"?

Original comment by nathan.s...@gmail.com on 30 Nov 2012 at 7:57

GoogleCodeExporter commented 8 years ago

"then try to serialize objects" in second to last paragraph should be "then try 
to deserialize objects".

Original comment by nathan.s...@gmail.com on 30 Nov 2012 at 8:01

GoogleCodeExporter commented 8 years ago

> I didn't understand what you meant by "pre-register objects very early and 
later 
> patch them"?

IIRC, most of the problems users recently reported regarding references were 
due to the fact that numbering of references (i.e. assignment of ids to 
references) when de-serializing objects went wrong and all subsequent reference 
ids got screwed. 

According to my analysis, for this to happen it is not always necessary that 
sub-objects reference the parent object. It is enough that this parent object 
does not call reference before reading sub-objects. 

As we know, it is too easy to forget calling reference() in a custom serializer 
at the right place :-(

More over, sometimes a parent object cannot be constructed before all 
sub-objects are read. In such cases, I used the trick where at the beginning of 
read() I first "pre-registered" a fake object instead of this parent object, 
which is not constructed yet. This pre-registration preserved a proper 
assignment of reference ids for sub-objects. Later, when I read all those 
sub-objects and constructed the real parent object, I changed/redefined the 
pre-registered mapping from the parent object reference id to a fake parent 
object to point to a real parent object. And this last step is what I called 
"patching" or more precisely back-patching. I use this terminology because it 
is similar to how you generate code for gotos that refer to forward label "L", 
whose address is not known at the time when you process a "goto L" statement. 

I hope I managed to explain what I had in mind. 
With all that, my impression is that Kryo could do this pre-registration 
automatically and after object is read it could "patch" the mapping to point to 
a  properly constructed object, if it has not happened yet.

Original comment by romixlev on 30 Nov 2012 at 8:45

GoogleCodeExporter commented 8 years ago

To give you an example of what I mean:
https://github.com/romix/scala-kryo-serialization/blob/master/src/main/scala/com
/romix/scala/serialization/kryo/ScalaCollectionsSerializer.scala

Have a look at the ScalaCollectionSerializer#read method and tricks I use there 
to pre-register a fake object and later replace the mapping with a proper 
object.

Original comment by romixlev on 30 Nov 2012 at 9:18

GoogleCodeExporter commented 8 years ago

BTW, the reason for producing a wrong assignment of ids to references when 
reading is easy to explain: to be correct, this assignment has to happen in 
exactly the same way as we do reference ids assignments during serialization. 
And this means, assign an id as soon as you read a reference header from the 
stream, even before the whole object is read. 

But this is not what we do now: Currently we first read the whole object and 
only  then assign a reference id and register a mapping from this id to the 
object.

Original comment by romixlev on 30 Nov 2012 at 9:23

GoogleCodeExporter commented 8 years ago

Ah, I see, your proposal is to hold the reference ID at the appropriate time, 
then later associate it with the correct object.

Let me take a run at describing the current system, as a refresher for myself 
if anything. When serializing, Kryo writes the reference ID just before calling 
Serializer#write(). When deserializing, Kryo reads the reference ID just before 
calling Serializer#read(). So far, so good, these will always match up.

Next, the reference ID that was read is stored in Kryo#readReferenceIds (a 
stack) because we don't yet have an object to associate with the ID. 
reference() associates an object with the top ID in the stack. If reference() 
is not called, Kryo calls it immediately after Serializer#read(). If 
reference() is not called and Kryo is used to deserialize child objects, the 
parent reference ID remains unassociated in the readReferenceIds stack, the 
child reference IDs go on the stack and are popped off, then the parent 
reference ID is popped off and associated with the right object.

I may be missing something, but there shouldn't be a way for the reference IDs 
to get off just because reference() is not called. The only ramification of not 
calling reference() when using Kryo to deserialize child objects should be that 
ReferenceResolver#getReadObject() will be called with an ID that it has not 
been added with addReadObject(). In fact, I believe in this case we can have 
getReadObject() throw an exception saying that an unknown reference ID has been 
encountered.

You are saying that for use case #2, reference() must be called even if the 
child objects don't reference the parent? Can you show this in a unit test? I 
did look at the Scala but it isn't clear as to why you are doing what you are 
doing. Also, Scala is rough on my poor eyes. ;)

Original comment by nathan.s...@gmail.com on 30 Nov 2012 at 9:33

GoogleCodeExporter commented 8 years ago

Nate, have you followed this thread on the mailing list:
https://groups.google.com/forum/?fromgroups=#!searchin/kryo-users/leo/kryo-users
/Eu5V4bxCfws/YjHGa59x5QwJ

This is an example of what can happen and I gave a rather long explanation 
there. Please have a look at the original example and explanations. I think 
this example is rather reproducible - at least it was for me. And stepping it 
through under debugger makes some things rather clear.

Original comment by romixlev on 30 Nov 2012 at 9:56

GoogleCodeExporter commented 8 years ago

> I did look at the Scala but it isn't clear as to why you are doing what you 
are 
> doing.

The idea is that in Scala, different from Java, in some cases I cannot 
construct a collection object before I start reading its elements. I can 
construct it only after I've read all the elements. 
But I have to call reference() anyway before reading sub-elements to preserve 
proper assignments of reference ids. Since a real collection does not exist 
yet, I pass a fake object. Later I refine this mapping with a proper collection 
object.

Original comment by romixlev on 30 Nov 2012 at 10:03

GoogleCodeExporter commented 8 years ago

Yes, I read all the messages on the list, though I don't always have time to 
respond.

From your explanation on that thread, "This nested call tries to read a 
reference to the parent list and assign a reference id to it." If 
deserialization of a child object can have a reference to the parent, then 
reference() must be called for the parent before deserialization of child 
objects.

I think this is complicated enough that I need some code that shows the 
problem. We can whip up some fake list classes that behave like Scala lists, 
where the items have to be deserialized first. I think this would make for the 
simplest and easiest to follow test. I'll do this today.

Original comment by nathan.s...@gmail.com on 30 Nov 2012 at 10:10

GoogleCodeExporter commented 8 years ago

The code provided by the reporter of the issue on that thread is such an 
example. Just take his serializer and his code for reproducing the problem. 
You'll see that it results in exactly the same error that he described.
Then try to look at the debug output and step through. You'll see pretty 
quickly what goes wrong.

Original comment by romixlev on 30 Nov 2012 at 10:24

GoogleCodeExporter commented 8 years ago

The real problem is most likely the situation where we have nested calls of 
readObject, e.g. when a sub-object is read using readObject inside 
Serializer#read. In such cases, if reference() is not called before the nested 
call, reference ids will get screwed, because the reference for the nested 
object will get the id that should have been assigned to the outer object.

Original comment by romixlev on 30 Nov 2012 at 10:27

GoogleCodeExporter commented 8 years ago

I've added a test to ReferenceTest that shows the issue and committed a fix. By 
making ReferenceResolver#nextReadId reserve the ID instead of just returning 
the next one, everything works like it should. As described above, child 
objects can be read before or after reference() is called, with the only 
ramification being that child objects read before reference() won't be able to 
reference the parent object.

I'll leave this issue open as it is a reminder to clean up the wiki. We can 
still carry on discussion about the references stuff here if needed though. :)

Original comment by nathan.s...@gmail.com on 4 Dec 2012 at 12:25

GoogleCodeExporter commented 8 years ago

Hi Nate,

Good that you looked into this issue and fixed at least the problem mentioned 
on the mailing list. I'll perform some further tests during this week using my 
private use-cases that had similar issues and let you know if they work 
properly now.

In the meantime, I have a question:
We have now in Kryo#reference() this snippet:

int id = readReferenceIds.pop();
if (id != NO_REF) referenceResolver.setReadObject(id, object);

As far as I can see, it is not quite idempotent, i.e. if in I call it (may be 
unintentionally) multiple times from the same serializer for the same object, 
it may result in a strange behavior due to a call of pop(). IMHO, it would be 
safer, if multiple calls of reference() when processing the same object would 
lead to the same results. Basically, would should try to avoid by all means 
that readReferenceIds and referenceResolver.seenObjects get misaligned and 
inconsistent with each other. IIRC, in some of my test-cases this was a reason 
of some problems.

Another question:
Do we still need to call reference() in the read method of a serializer for an 
enclosing object even if sub-objects do not refer to their enclosing object? If 
we don't call reference(), e.g. because it was forgotten, what happens? It 
looks like nextReadId is not called then, or? Or reference() is called 
automatically by e.g. Kryo#readClassAndObject. But this may happen only after 
the whole enclosing object was read, i.e. it may happen after sub-objects and 
their references were processed and reference() was called for them. I'm still 
wondering if it is always OK in such cases that reference() for the enclosing 
object is called after reference() for sub-objects. Doesn't it change the order 
as compared to the write order? IMHO, it would be safer if in the following 
code that we use everywhere, we would invoke reference(fakeObj) automatically 
right after readReferenceOrNull call and remember the id, because this would be 
exactly in the same order, as during writes. This would reserve the id at the 
right time. And later, we can change the call reference(object) in the last 
line to referenceResolver.setReadObject(id, object) thus creating a proper 
mapping to a fully build object.

if (references) {
    int stackSize = readReferenceOrNull(input, type, false);
    if (stackSize == REF) return readObject;
    object = registration.getSerializer().read(this, input, type);
    if (stackSize == readReferenceIds.size) reference(object);}

Of course, with this proposed change, we need also to take care of potential 
calls to reference() from Serializer#read(). Eventually, Serializer#read needs 
to know the id of the reference belonging to the object being read, so that it 
can set a mapping from this id to a proper object.

Overall, I think that in the ideal case, serializer should only call 
reference() explicitly if sub-objects may refer to the enclosing object. In all 
other cases Kryo should do everything automatically. This would be the safest 
approach.

Are my concerns unjustified and I overcomplicate things? I just seem to 
remember that when I looked into different failing tests I noticed that 
reference handling problems turned out to be more serious and deeper than they 
seemed at the first glance.

Original comment by romixlev on 4 Dec 2012 at 2:58

GoogleCodeExporter commented 8 years ago

It is true that calling reference() multiple times will cause an extra pop(), 
which will crash. This will crash every time though, it will never fail 
silently. I agree it would be better if reference() could be called multiple 
times without a problem, but in this case I don't see any easy way to guard 
against it. Also, I think it would be quite abnormal to call reference() twice.

> Do we still need to call reference() in the read method of a serializer 
> for an enclosing object even if sub-objects do not refer to their 
> enclosing object?

No, you should only have to call reference() if you are reading child objects 
that could reference the parent. If the child objects won't reference the 
parent, everything will still work, even if the child objects call reference(). 
It is definitely a tricky piece of code, but I think the implementation is 
sound. It works basically as you describe: the ID is read from the bytes, a 
null is stored for the ID in ReferenceResolver#nextReadId(), and later the null 
is replaced with the actual object for that ID.

You are right that the parent object reference at serialization time may be 
stored sooner than at read time. This is because for serialization we always 
have the object and can store it first. During read time we have to create the 
object, which may involve reading child objects first. However by definition 
the child objects can't reference the parent object if they are required to 
create it. If they did, they would appear in the serialized bytes before the 
parent object (ie, they wouldn't be child objects).

I don't think we ever need to let the serializer know about the reference IDs. 
Ideally we can keep it hidden and just have a simple rule: if a serializer 
needs to deserialize child objects that may reference the parent, call 
reference(). Otherwise, don't worry about it.

I look forward to hearing if your tests are solved by the latest fix. :)

Original comment by nathan.s...@gmail.com on 5 Dec 2012 at 8:08

GoogleCodeExporter commented 8 years ago

Note I've done a couple commits to simplify the stock ReferenceResolvers 
slightly. Hopefully this doesn't cause any bugs. :)

Original comment by nathan.s...@gmail.com on 6 Dec 2012 at 1:44

GoogleCodeExporter commented 8 years ago

Hi Nate,

I finally found some time to test your fixes. It seems to work fine for my 
tests and also for my Scala tests in akka-kryo-serialization! Now I don't need 
those strange workarounds for Scala collections that I mentioned earlier in 
this thread. 

Good job!

Original comment by romixlev on 11 Dec 2012 at 10:34

GoogleCodeExporter commented 8 years ago

Fantastic news! The references code is some hairy stuff, but I think the 
current approach is sound and the API is kept nice -- just call reference() 
before deserializing children, otherwise don't worry about it. Hopefully no 
more issues come up because changes to this part of the code causes gray hairs. 
:D

Original comment by nathan.s...@gmail.com on 12 Dec 2012 at 2:57

Nov11 / kryo

Wiki should document references #98