NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

[FEA] Verify unescapedQuoteHandling with CSV reader #1524

Open mythrocks opened 3 years ago

mythrocks commented 3 years ago

This arises from audit of https://github.com/apache/spark/commit/433ae9064f.

Spark 3.1 has changed the behaviour of the CSV reader. It now decides whether to stop parsing at the delimiter based on the value of unescapedQuoteHandling.

spark-rapids needs to ensure that reading CSV tables through the plugin will honour the settings for unescapedQuoteHandling.

More info in the JIRA: https://issues.apache.org/jira/browse/SPARK-33566

revans2 commented 3 years ago

Just for information, our CSV does not really match Spark's all that closely. We should test it, but we might just end up documenting an incompatibility.