NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

Fix a NPE issue in GpuRand #11647

Closed firestarman closed 3 weeks ago

firestarman commented 1 month ago

close https://github.com/NVIDIA/spark-rapids/issues/11646

curXORShiftRandomSeed is marked as transient, so it will be null on executors without retry-restore context, leading to this NPE. This fix removes the transient for curXORShiftRandomSeed, seed and previousPartition that will be used by the computation on executors.

I verified it by the customer case, and it works well. The fix is simple so i don't add any tests.

firestarman commented 1 month ago

build

revans2 commented 1 month ago

I spoke with @jlowe and I think we really want to understand this better. https://github.com/NVIDIA/spark-rapids/issues/11649

The problem is that if a retry happens and it is not in a checkpoint/restore, then we will technically get data corruption. It is not 100% data corruption because it is a random, so we get a slightly different random number compared to Spark on the CPU, which is the only reason I am not blocking this from going in. But I really want to understand the situation where this happened.