Closed firestarman closed 3 weeks ago
build
I spoke with @jlowe and I think we really want to understand this better. https://github.com/NVIDIA/spark-rapids/issues/11649
The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened.
close https://github.com/NVIDIA/spark-rapids/issues/11646
curXORShiftRandomSeed
is marked astransient
, so it will be null on executors without retry-restore context, leading to this NPE. This fix removes thetransient
forcurXORShiftRandomSeed
,seed
andpreviousPartition
that will be used by the computation on executors.I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.