NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
782 stars 228 forks source link

[BUG] CSV/JSON data sources should avoid globbing paths when inferring schema #11158

Open thirtiseven opened 1 month ago

thirtiseven commented 1 month ago

Describe the bug

https://github.com/apache/spark/pull/29659 fixed an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:

It makes Spark UT SPARK-32810: JSON data source should be able to read files with escaped glob metacharacter in the paths failed in plugin.

Steps/Code to reproduce bug

scala> val data = Seq(("Alice", 28), ("Bob", 34), ("Cathy", 23))
data: Seq[(String, Int)] = List((Alice,28), (Bob,34), (Cathy,23))

scala> val df = data.toDF("name", "age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.write.mode("OVERWRITE").json("""[abc].json""")
24/07/09 08:02:19 WARN GpuOverrides:
!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHadoopFsRelationCommand> cannot run on GPU because JSON output is not supported
  ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
    @Expression <AttributeReference> name#7 could run on GPU
    @Expression <AttributeReference> age#8 could run on GPU

scala> spark.conf.set("spark.rapids.sql.format.json.enabled", true)

scala> spark.conf.set("spark.rapids.sql.format.json.read.enabled", true)

scala> val dfRead = spark.read.json("""\[abc\].json""")
24/07/09 08:02:43 WARN GpuOverrides:
!Exec <FileSourceScanExec> cannot run on GPU because unsupported file format: org.apache.spark.sql.execution.datasources.text.TextFileFormat

dfRead: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> dfRead.show()
24/07/09 08:02:48 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(age#21L as string) AS age#27 will run on GPU
      *Expression <Cast> cast(age#21L as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/07/09 08:02:48 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 8)
org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
    at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
    at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
    at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
    at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
    ... 23 more
24/07/09 08:02:48 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
    at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
    at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
    at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
    at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
    ... 23 more

24/07/09 08:02:48 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 8) (spark-haoyang executor driver): org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
    at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
    at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
    at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
    at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
    at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
    at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
    ... 23 more

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
  at scala.Option.foreach(Option.scala:407)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:506)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:459)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3868)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
  at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2863)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:3084)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:808)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:767)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:776)
  ... 47 elided
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Encountered error while reading file file:///home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json. Details:
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:713)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:288)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
  at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:136)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File file:/home/haoyangl/%5Babc%5D.json/part-00000-2707a4f7-da03-40b0-a78c-8f21b2f463a5-c000.json does not exist
  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100)
  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize$lzycompute(GpuTextBasedPartitionReader.scala:346)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.estimatedHostBufferSize(GpuTextBasedPartitionReader.scala:342)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readPartFile$1(GpuTextBasedPartitionReader.scala:368)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readPartFile(GpuTextBasedPartitionReader.scala:362)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readToTable$1(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:180)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readToTable(GpuTextBasedPartitionReader.scala:467)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$readBatch$1(GpuTextBasedPartitionReader.scala:393)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.readBatch(GpuTextBasedPartitionReader.scala:391)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.next(GpuTextBasedPartitionReader.scala:612)
  at com.nvidia.spark.rapids.PartitionReaderWithBytesRead.next(dataSourceUtil.scala:62)
  at com.nvidia.spark.rapids.ColumnarPartitionReaderWithPartitionValues.next(ColumnarPartitionReaderWithPartitionValues.scala:36)
  at com.nvidia.spark.rapids.PartitionReaderIterator.hasNext(PartitionReaderIterator.scala:44)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
  ... 23 more

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> dfRead.show()
+---+-----+
|age| name|
+---+-----+
| 28|Alice|
| 23|Cathy|
| 34|  Bob|
+---+-----+

Expected behavior Read files with escaped glob meta characters successfully.

mattahrens commented 1 month ago

Need to test with Parquet and ORC to make sure the issue does not exist with those data sources.

thirtiseven commented 1 month ago

test with Parquet and ORC to make sure the issue does not exist with those data sources.

Verified by testing that Parquet and Orc does not exist this issue.