Open GoogleCodeExporter opened 8 years ago
Hi Nate,
While I totally agree that current semantics of references should be
documented, I think something could be done in the core Kryo logic to improve
the situation and make using them less error-prone for users and serializer
writers. I haven't looked too deep into it yet, but my impression was that Kryo
registers references too late (after object was completely read). This leads to
problems in situations, where objects have sub-objects, etc.
I think something could be done, if Kryo would "pre-register" objects very
early (i.e. almost after reference was read from a stream) and later patch
them with proper objects. Or something along these lines. What do you think?
-Leo
Original comment by romixlev
on 30 Nov 2012 at 6:35
I'm all for improving it if we can come up with something that works better. :)
The complexity is mostly during deserialization because Serializer#read() both
creates the new object and may use Kryo to deserialize other objects that may
reference the new object.
I'll list some use cases. I hope I get the current behavior correct!
1) Serializer#read() creates a new object and returns it without deserializing
other objects via Kryo.
Current behavior: Nothing special needs to be done, Kryo detects reference()
was not called and stores a reference to the object.
2) read() creates a new object then deserializes other objects via Kryo. The
other objects are sure to never reference the new object.
Current behavior: Nothing special needs to be done, Kryo detects reference()
was not called and stores a reference to the object.
3) read() creates a new object then deserializes other objects via Kryo. The
other objects may reference the new object.
Current behavior: reference() has to be called before deserialization of the
other objects.
4) read() deserializes other objects via Kryo then creates a new object (eg
DefaultSerializers.TreeMapSerializer, but update first). The objects
deserialized first cannot reference the new object.
Current behavior: other objects are deserialized via Kryo, then the new object
created, then see use cases 1, 2, or 3.
Are there other use cases? Let's list them all so we can see where the current
behavior is less than ideal.
These seemed to be solved pretty well currently, though there is the downside
that if you fail to call reference() for an object and then try to serialize
objects that reference the object it will fail. At one point calling
reference() before using Kryo to deserialize other objects was enforced with an
exception, but this ended up getting removed. I don't remember the reason why,
but I think it was too clunky for use case #4, which required references to be
disabled temporarily.
I didn't understand what you meant by "pref-register objects very early and
later patch them"?
Original comment by nathan.s...@gmail.com
on 30 Nov 2012 at 7:57
"then try to serialize objects" in second to last paragraph should be "then try
to deserialize objects".
Original comment by nathan.s...@gmail.com
on 30 Nov 2012 at 8:01
> I didn't understand what you meant by "pre-register objects very early and
later
> patch them"?
IIRC, most of the problems users recently reported regarding references were
due to the fact that numbering of references (i.e. assignment of ids to
references) when de-serializing objects went wrong and all subsequent reference
ids got screwed.
According to my analysis, for this to happen it is not always necessary that
sub-objects reference the parent object. It is enough that this parent object
does not call reference before reading sub-objects.
As we know, it is too easy to forget calling reference() in a custom serializer
at the right place :-(
More over, sometimes a parent object cannot be constructed before all
sub-objects are read. In such cases, I used the trick where at the beginning of
read() I first "pre-registered" a fake object instead of this parent object,
which is not constructed yet. This pre-registration preserved a proper
assignment of reference ids for sub-objects. Later, when I read all those
sub-objects and constructed the real parent object, I changed/redefined the
pre-registered mapping from the parent object reference id to a fake parent
object to point to a real parent object. And this last step is what I called
"patching" or more precisely back-patching. I use this terminology because it
is similar to how you generate code for gotos that refer to forward label "L",
whose address is not known at the time when you process a "goto L" statement.
I hope I managed to explain what I had in mind.
With all that, my impression is that Kryo could do this pre-registration
automatically and after object is read it could "patch" the mapping to point to
a properly constructed object, if it has not happened yet.
Original comment by romixlev
on 30 Nov 2012 at 8:45
To give you an example of what I mean:
https://github.com/romix/scala-kryo-serialization/blob/master/src/main/scala/com
/romix/scala/serialization/kryo/ScalaCollectionsSerializer.scala
Have a look at the ScalaCollectionSerializer#read method and tricks I use there
to pre-register a fake object and later replace the mapping with a proper
object.
Original comment by romixlev
on 30 Nov 2012 at 9:18
BTW, the reason for producing a wrong assignment of ids to references when
reading is easy to explain: to be correct, this assignment has to happen in
exactly the same way as we do reference ids assignments during serialization.
And this means, assign an id as soon as you read a reference header from the
stream, even before the whole object is read.
But this is not what we do now: Currently we first read the whole object and
only then assign a reference id and register a mapping from this id to the
object.
Original comment by romixlev
on 30 Nov 2012 at 9:23
Ah, I see, your proposal is to hold the reference ID at the appropriate time,
then later associate it with the correct object.
Let me take a run at describing the current system, as a refresher for myself
if anything. When serializing, Kryo writes the reference ID just before calling
Serializer#write(). When deserializing, Kryo reads the reference ID just before
calling Serializer#read(). So far, so good, these will always match up.
Next, the reference ID that was read is stored in Kryo#readReferenceIds (a
stack) because we don't yet have an object to associate with the ID.
reference() associates an object with the top ID in the stack. If reference()
is not called, Kryo calls it immediately after Serializer#read(). If
reference() is not called and Kryo is used to deserialize child objects, the
parent reference ID remains unassociated in the readReferenceIds stack, the
child reference IDs go on the stack and are popped off, then the parent
reference ID is popped off and associated with the right object.
I may be missing something, but there shouldn't be a way for the reference IDs
to get off just because reference() is not called. The only ramification of not
calling reference() when using Kryo to deserialize child objects should be that
ReferenceResolver#getReadObject() will be called with an ID that it has not
been added with addReadObject(). In fact, I believe in this case we can have
getReadObject() throw an exception saying that an unknown reference ID has been
encountered.
You are saying that for use case #2, reference() must be called even if the
child objects don't reference the parent? Can you show this in a unit test? I
did look at the Scala but it isn't clear as to why you are doing what you are
doing. Also, Scala is rough on my poor eyes. ;)
Original comment by nathan.s...@gmail.com
on 30 Nov 2012 at 9:33
Nate, have you followed this thread on the mailing list:
https://groups.google.com/forum/?fromgroups=#!searchin/kryo-users/leo/kryo-users
/Eu5V4bxCfws/YjHGa59x5QwJ
This is an example of what can happen and I gave a rather long explanation
there. Please have a look at the original example and explanations. I think
this example is rather reproducible - at least it was for me. And stepping it
through under debugger makes some things rather clear.
Original comment by romixlev
on 30 Nov 2012 at 9:56
> I did look at the Scala but it isn't clear as to why you are doing what you
are
> doing.
The idea is that in Scala, different from Java, in some cases I cannot
construct a collection object before I start reading its elements. I can
construct it only after I've read all the elements.
But I have to call reference() anyway before reading sub-elements to preserve
proper assignments of reference ids. Since a real collection does not exist
yet, I pass a fake object. Later I refine this mapping with a proper collection
object.
Original comment by romixlev
on 30 Nov 2012 at 10:03
Yes, I read all the messages on the list, though I don't always have time to
respond.
From your explanation on that thread, "This nested call tries to read a
reference to the parent list and assign a reference id to it." If
deserialization of a child object can have a reference to the parent, then
reference() must be called for the parent before deserialization of child
objects.
I think this is complicated enough that I need some code that shows the
problem. We can whip up some fake list classes that behave like Scala lists,
where the items have to be deserialized first. I think this would make for the
simplest and easiest to follow test. I'll do this today.
Original comment by nathan.s...@gmail.com
on 30 Nov 2012 at 10:10
The code provided by the reporter of the issue on that thread is such an
example. Just take his serializer and his code for reproducing the problem.
You'll see that it results in exactly the same error that he described.
Then try to look at the debug output and step through. You'll see pretty
quickly what goes wrong.
Original comment by romixlev
on 30 Nov 2012 at 10:24
The real problem is most likely the situation where we have nested calls of
readObject, e.g. when a sub-object is read using readObject inside
Serializer#read. In such cases, if reference() is not called before the nested
call, reference ids will get screwed, because the reference for the nested
object will get the id that should have been assigned to the outer object.
Original comment by romixlev
on 30 Nov 2012 at 10:27
I've added a test to ReferenceTest that shows the issue and committed a fix. By
making ReferenceResolver#nextReadId reserve the ID instead of just returning
the next one, everything works like it should. As described above, child
objects can be read before or after reference() is called, with the only
ramification being that child objects read before reference() won't be able to
reference the parent object.
I'll leave this issue open as it is a reminder to clean up the wiki. We can
still carry on discussion about the references stuff here if needed though. :)
Original comment by nathan.s...@gmail.com
on 4 Dec 2012 at 12:25
Hi Nate,
Good that you looked into this issue and fixed at least the problem mentioned
on the mailing list. I'll perform some further tests during this week using my
private use-cases that had similar issues and let you know if they work
properly now.
In the meantime, I have a question:
We have now in Kryo#reference() this snippet:
int id = readReferenceIds.pop();
if (id != NO_REF) referenceResolver.setReadObject(id, object);
As far as I can see, it is not quite idempotent, i.e. if in I call it (may be
unintentionally) multiple times from the same serializer for the same object,
it may result in a strange behavior due to a call of pop(). IMHO, it would be
safer, if multiple calls of reference() when processing the same object would
lead to the same results. Basically, would should try to avoid by all means
that readReferenceIds and referenceResolver.seenObjects get misaligned and
inconsistent with each other. IIRC, in some of my test-cases this was a reason
of some problems.
Another question:
Do we still need to call reference() in the read method of a serializer for an
enclosing object even if sub-objects do not refer to their enclosing object? If
we don't call reference(), e.g. because it was forgotten, what happens? It
looks like nextReadId is not called then, or? Or reference() is called
automatically by e.g. Kryo#readClassAndObject. But this may happen only after
the whole enclosing object was read, i.e. it may happen after sub-objects and
their references were processed and reference() was called for them. I'm still
wondering if it is always OK in such cases that reference() for the enclosing
object is called after reference() for sub-objects. Doesn't it change the order
as compared to the write order? IMHO, it would be safer if in the following
code that we use everywhere, we would invoke reference(fakeObj) automatically
right after readReferenceOrNull call and remember the id, because this would be
exactly in the same order, as during writes. This would reserve the id at the
right time. And later, we can change the call reference(object) in the last
line to referenceResolver.setReadObject(id, object) thus creating a proper
mapping to a fully build object.
if (references) {
int stackSize = readReferenceOrNull(input, type, false);
if (stackSize == REF) return readObject;
object = registration.getSerializer().read(this, input, type);
if (stackSize == readReferenceIds.size) reference(object);}
Of course, with this proposed change, we need also to take care of potential
calls to reference() from Serializer#read(). Eventually, Serializer#read needs
to know the id of the reference belonging to the object being read, so that it
can set a mapping from this id to a proper object.
Overall, I think that in the ideal case, serializer should only call
reference() explicitly if sub-objects may refer to the enclosing object. In all
other cases Kryo should do everything automatically. This would be the safest
approach.
Are my concerns unjustified and I overcomplicate things? I just seem to
remember that when I looked into different failing tests I noticed that
reference handling problems turned out to be more serious and deeper than they
seemed at the first glance.
Original comment by romixlev
on 4 Dec 2012 at 2:58
It is true that calling reference() multiple times will cause an extra pop(),
which will crash. This will crash every time though, it will never fail
silently. I agree it would be better if reference() could be called multiple
times without a problem, but in this case I don't see any easy way to guard
against it. Also, I think it would be quite abnormal to call reference() twice.
> Do we still need to call reference() in the read method of a serializer
> for an enclosing object even if sub-objects do not refer to their
> enclosing object?
No, you should only have to call reference() if you are reading child objects
that could reference the parent. If the child objects won't reference the
parent, everything will still work, even if the child objects call reference().
It is definitely a tricky piece of code, but I think the implementation is
sound. It works basically as you describe: the ID is read from the bytes, a
null is stored for the ID in ReferenceResolver#nextReadId(), and later the null
is replaced with the actual object for that ID.
You are right that the parent object reference at serialization time may be
stored sooner than at read time. This is because for serialization we always
have the object and can store it first. During read time we have to create the
object, which may involve reading child objects first. However by definition
the child objects can't reference the parent object if they are required to
create it. If they did, they would appear in the serialized bytes before the
parent object (ie, they wouldn't be child objects).
I don't think we ever need to let the serializer know about the reference IDs.
Ideally we can keep it hidden and just have a simple rule: if a serializer
needs to deserialize child objects that may reference the parent, call
reference(). Otherwise, don't worry about it.
I look forward to hearing if your tests are solved by the latest fix. :)
Original comment by nathan.s...@gmail.com
on 5 Dec 2012 at 8:08
Note I've done a couple commits to simplify the stock ReferenceResolvers
slightly. Hopefully this doesn't cause any bugs. :)
Original comment by nathan.s...@gmail.com
on 6 Dec 2012 at 1:44
Hi Nate,
I finally found some time to test your fixes. It seems to work fine for my
tests and also for my Scala tests in akka-kryo-serialization! Now I don't need
those strange workarounds for Scala collections that I mentioned earlier in
this thread.
Good job!
Original comment by romixlev
on 11 Dec 2012 at 10:34
Fantastic news! The references code is some hairy stuff, but I think the
current approach is sound and the API is kept nice -- just call reference()
before deserializing children, otherwise don't worry about it. Hopefully no
more issues come up because changes to this part of the code causes gray hairs.
:D
Original comment by nathan.s...@gmail.com
on 12 Dec 2012 at 2:57
Original issue reported on code.google.com by
nathan.s...@gmail.com
on 30 Nov 2012 at 5:44