EsotericSoftware / kryo

Java binary serialization and cloning: fast, efficient, automatic
BSD 3-Clause "New" or "Revised" License
6.19k stars 823 forks source link

Kryo IndexOutofBoundsException in MapReferenceResolver #428

Closed tenstriker closed 6 years ago

tenstriker commented 8 years ago

I see people are facing this issue quite often. I am facing this as well during deserialization. Looks like MapReferenceResolver.getReadObject trying to access incorrect index.

Job aborted due to stage failure: Task 16 in stage 9.0 failed 10 times, most recent failure: Lost task 16.9 in stage 9.0 (TID 28614, hdn10.mycorptcorporation.local): com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6 Serialization trace: familyMap (org.apache.hadoop.hbase.client.Put) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:966) at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:972) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:80) at com.mycorpt.myprojjobs.spark.jobs.hbase.HbaseUtils$$anonfun$writeRddToHBase2$1.apply(HbaseUtils.scala:75) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.IndexOutOfBoundsException: Index: 100, Size: 6 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:773) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727) at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134) at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) ... 26 more

magro commented 8 years ago

I guess we don't need to reopen #255 then, right? Can you provide an sscce?

tenstriker commented 8 years ago

I use kryo as a part of spark 1.5.2 library which uses twitter-chill underneath. From my spark application point of view it just configurations and almost no code to register classes to kryo. Could this be related to reading particular serialized object - org.apache.hadoop.hbase.client.Put? It does throws exception reading back that object.

SiphonSquirrel commented 8 years ago

I ran in to this earlier using Spark 1.5 + HBase 1.0. The version of Kryo that is bring brought in for our CDH 5.5 version of Spark is 2.21. Inside of the HBase Put object is a TreeMap, which led me to find #112 which was fixed in commit 00ffc7e . I believe what might be happening is a race condition where the TreeMap serializer is messing with setReferences (setting it to false, then setting it back).

So I worked around this by setting references to false using the Spark property spark.kryo.referenceTracking (found here: http://spark.apache.org/docs/latest/configuration.html ) to false. This seems to have fixed the problem for now.

magro commented 8 years ago

@SiphonSquirrel great, thanks for stepping in! Btw, there was also #312 reported for 2.21 which we merged and released as 2.21.1 (which might make sense to use of you can change the pulled in kryo version).

tenstriker commented 8 years ago

thanks @SiphonSquirrel for workaround. I found my spark assembly jar uses kryo 2.21. Will try to override that somehow as well with latest one.

magro commented 8 years ago

@tenstriker spark upgraded recently to the latest kryo version, not sure if that's released already. See https://github.com/apache/spark/pull/12076

magro commented 8 years ago

Is this still an issue that needs a resolution from kryo, or can we close it?

magro commented 8 years ago

Closed for now, please reopen if still relevant.

David-hod commented 7 years ago

I am facing same issue. Does upgrade solved this issue of serializing hbase 'Put' objects ? any recommended workaround ?

mooncake020 commented 7 years ago

I am facing same issue too

kratos1308 commented 6 years ago

Same issue. Does not seem the problem was solved anywhere. What's up since September 2017 ?

tjormola commented 6 years ago

Hi,

I just encountered this problem, too, while experimenting with distributed state machine support of Spring State Machine. I'm using org.springframework.statemachine:spring-statemachine-zookeeper:2.0.1.RELEASE which pulls com.esotericsoftware:kryo-shaded:3.0.3. I also tried with compiling and running against Kryo 4.0.2 but the problem remains unfixed.

I have no idea what might be going on regarding the root cause in Kryo and I have no interest in digging the Kryo internals any further but here's some screenshots showing how this is triggered for me and some notes made during debugging. Hopefully someone finds this useful...

  1. https://drive.google.com/open?id=1Q4e2YL4LJA-jpdYkHrXJ7eRwjNMlweF1 We start here in the SSM code, the intention is to restore the state machine state from the blob read from Zookeeper.
  2. https://drive.google.com/open?id=1Dfj0hdTblAv4BFAxuwoLcA9Vajyano71 SSM Zookeeper code that pulls the blob from Zookeeper and deserializes eventually using Kryo
  3. https://drive.google.com/open?id=1BJbOC8R4FOQ6aHatcL0E7s0eKll7Xl78 ...Continues
  4. https://drive.google.com/open?id=1n56GXkKYN4SJFtBVXDFIm9aJ1FldmDHB Jump to Kryo deserialization code
  5. https://drive.google.com/open?id=1cLHTmUAlbk3Z68T4l_sPTmjT8HNUy1nO ...Continues. The breakpoint here is interesting because this subroutine returns the initial id value that is eventually used as an index/key to a list that triggers the exception in question.
  6. https://drive.google.com/open?id=1oZz5_v2J0IaMNul7JWAQPS7gwwiQBirR Code that returns the id integer value discussed above
  7. https://drive.google.com/open?id=1ituv53jtgQW1SggkQzCVyXccX6ieYzJq About to call the Kryo code that uses the id
  8. https://drive.google.com/open?id=1An3wNzb4KDlW24Iop8CniLIRSW7K7Xdj Bang, the integer id is used as a list index with an empty list and naturally this is triggering the IndexOutOfBoundsException in question. What is this referenceResolver thing and why it is empty at this stage? Supposedly org.springframework.statemachine.StateMachineContext should be stored there and retrievable by the id but it isn't. How so, that I do not know and have no desire to research ;)
  9. https://drive.google.com/open?id=1esesZf_GSpMw8_hJm6r8FfRNNPp-Iaod After throwing the exception, we're back to SSM code, exception handler to be more precise.
  10. https://drive.google.com/open?id=1L-zADRBO8-IuRQhjd2VM_JNmNRlcAKys Finally, the SSM code happily continues with broken internal reference objects... The state wrapper object currentStateWrapper is never created due to this exception thus the reference objects stateRef and notifyRef continue referencing to null instead of the state object initialized with the data load from Zookeeper and wrapped for use. The state machine engine will try continuing anyway, but naturally problems will ensue down the road.
tjormola commented 6 years ago

Reopen the issue, please?

magro commented 6 years ago

@NathanSweet Maybe you have an idea regarding this reference resolver issue here? Otherwise I guess we need a reproducer that isolates the issue.

NathanSweet commented 6 years ago

A serialization failure can happen for all kinds of reasons. Without a reproduction case, I'm afraid it's anybody's guess. The stacktrace in the issue is very old and it isn't clear what line that would be in the latest. If someone can reproduce this, it should be tested against the kryo-5.0.0-dev branch to see if it is already fixed.

saimonsez commented 5 years ago

In case anyone is still suffering from this: It's related to https://github.com/EsotericSoftware/kryo#thread-safety. I experienced the exact same IndexOutOfBoundsException using kryo with hazelcast in a spring boot app. Under heavy load ... wait ... heavy load? Thread safety? Let me check my code.

//use ThreadLocal because Kryo is not thread safe
    private static final InheritableThreadLocal<Kryo> kryoThreadLocal = new InheritableThreadLocal<Kryo>() {
        @Override
        protected Kryo initialValue() {
            Kryo kryo = new KryoReflectionFactorySupport();
            //Kryo uses its own class loader
            kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
            UnmodifiableCollectionsSerializer.registerSerializers(kryo);
            return kryo;
        }
    };

So the only kryo instance is bound at startup to the current thread. This will do, as long as there are no more threads involved. Say, you use spring's @Async (my case) or something similiar. Busted.

Again, check out https://github.com/EsotericSoftware/kryo#thread-safety, use kryo's pooling and you are done.

Cheers.

alisha287 commented 3 years ago

I was facing this issue. On debugging I found this was happening while reading large objects. My guess is incorrect deserialization was causing this, because there weren't enough bytes, leading to incorrect reference-id resolution. Increasing the buffer size worked for me.