NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
806 stars 234 forks source link

[BUG] rereading a written csv file drops rows #6917

Open eordentlich opened 2 years ago

eordentlich commented 2 years ago

Describe the bug Reading a csv file results in a dataframe with 120000 rows. Writing the dataframe to a new csv file and then rereading with spark-rapids results in a dataframe with fewer rows (119470 if the write is done by spark rapids, 117308 if the write is done with pyspark). No error is thrown.

Steps/Code to reproduce bug This was observed with the csv file downloadable here.

Note this csv file uses quotes for the second column to allow commas in the fields, and also has escaped quotes.

Use pyspark api, tested in Databricks python notebook. Haven't checked local.

Download and save the file to dbfs:/news_category_train.csv.

Run the following:

news_df = spark.read.option("header", True).csv("dbfs:/news_category_train.csv")
print(news_df.count())
news_df.write.csv("dbfs:/news_category_train_rapids.csv",header=True)
news_df_reread = spark.read.option("header", True).csv("dbfs:/news_category_train_rapids.csv")
print(news_df_reread.count())

The two counts were observed to be different with the first one at 120000 and the second one at 119470. Similar behavior if first read and write are done without spark-rapids, and the second read is via spark rapids.

Baseline spark yields 120000 rows in all cases, even when reading spark-rapids written csv.

Expected behavior No dropping of rows when writing and rereading a Dataframe in csv mode, originally read from a csv file.

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

firestarman commented 2 years ago

The root cause is probably the same with https://github.com/NVIDIA/spark-rapids/issues/6435#issuecomment-1285842837.

Some rows are changed in the writen CSV file, comparing to the original file. And some of the changed rows now contain strings like x"", which the GPU reading can not handle correctly, as mentioned in https://github.com/NVIDIA/spark-rapids/issues/6435#issuecomment-1285842837.

Here is a minimal repro case. The 4 rows are picked from the writen file, and the first row ends with \"", leading to the following two rows being skipped in GPU reading.

category,description
Sci/Tech,The U.S. Forest Service on Wednesday rejected environmentalists' appeal of a plan to poison a stream south of Lake Tahoe to aid what wildlife officials call \"the rarest trout in America.\""
Sci/Tech,One of the pleasures of    stargazing is noticing and enjoying the various colors that stars display in    dark skies. These hues offer direct visual evidence of how stellar temperatures    vary.
Sci/Tech,"Britain granted its first license for human cloning Wednesday, joining South Korea on the leading edge of stem cell research, which is restricted by the Bush administration and which many scientists believe may lead to new treatments for a range of diseases."
Sci/Tech,Meteorologists at North Carolina State University are working on a way to more accurately measure rainfall in small areas.

CPU

scala> spark.conf.set("spark.rapids.sql.enabled", "false")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|One of the pleasu...|
|Sci/Tech|Britain granted i...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+

GPU

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> spark.read.option("header", "true").csv("/data/tmp/6917/test.csv").show()
+--------+--------------------+
|category|         description|
+--------+--------------------+
|Sci/Tech|The U.S. Forest S...|
|Sci/Tech|Meteorologists at...|
+--------+--------------------+