cleanzr / dblink

Distributed Bayesian Entity Resolution in Apache Spark
Other
57 stars 9 forks source link

SHIW0810 #9

Open YathishK opened 5 years ago

YathishK commented 5 years ago

d-blink is issuing an error on data set SHIW0810 with sample size of 10K , application is crashing.

On the same data set with sample size of 1000 , it is causing a soft error however application is producing results.

Both driver logs are attached here.

hard_error_driver_log.txt soft_error_driver_log.txt

Thank you, Yathish

ngmarchant commented 5 years ago

I haven't encountered this error before. The java.io.OptionalDataException is likely to occur when there's a discrepancy in the code used to serialize/deserialize an object. I wonder if this is being caused by a mismatch in Scala and/or Spark versions. Is the same version of Scala/Spark deployed on the driver and worker nodes? Have you built the JAR files using versions that match your deployment?

Regarding the "soft" error:

WARN AsyncEventQueue: Dropped * events from eventLog since *.

This may be an instance of this upstream issue. It's apparently fixed in Spark 2.4.0.