NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
784 stars 228 forks source link

[BUG] Integration tests FAILED on EMR cluster #8910

Open NvTimLiu opened 1 year ago

NvTimLiu commented 1 year ago

Describe the bug Python integration tests failed on latest EMR 6.12.0 cluster [spark-rapids v23.06.0 jar special for EMR] , FAILED files:

 csv_test.py
 datasourcev2_read_test.py
 json_test.py
 mortgage_test.py
 orc_test.py
 parquet_test.py
 row-based_udf_test.py
 udf_test.py

FAILED test cases:consoleText2.txt


 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/csv_testm.py::test_csv_scan_with_hidden_metadata_fallback[file_path][INJECT_OOM, ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_scan_with_hidden_metadata_fallback[file_name][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_scan_with_hidden_metadata_fallback[file_size][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_scan_with_hidden_metadata_fallback[file_modification_time][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_datetime_parsing_fallback_cpu_fallback[date.csv-schema0][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_datetime_parsing_fallback_cpu_fallback[date.csv-schema1][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/csv_test.py::test_csv_datetime_parsing_fallback_cpu_fallback[ts.csv-schema2][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 = 7 failed, 740 passed, 19223 deselected, 48 xfailed, 8 xpassed, 60 warnings in 577.02s (0:09:37) =

 =========================== short test summary info ============================
 integration_tests/src/main/python/datasourcev2_read_test.py::test_read_int[INJECT_OOM] FAILED [ 20%]
 integration_tests/src/main/python/datasourcev2_read_test.py::test_read_strings FAILED [ 40%]
 integration_tests/src/main/python/datasourcev2_read_test.py::test_read_all_types FAILED [ 60%]
 integration_tests/src/main/python/datasourcev2_read_test.py::test_read_all_types_count FAILED [ 80%]
 integration_tests/src/main/python/datasourcev2_read_test.py::test_read_arrow_off[INJECT_OOM] FAILED [100%]

 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_valid_dates[LEGACY-true-read_json_df-schema0-dates.json][INJECT_OOM, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_valid_dates[LEGACY-false-read_json_df-schema0-dates.json][INJECT_OOM, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_invalid_dates[LEGACY-true-read_json_df-schema0-dates_invalid.json][APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_invalid_dates[LEGACY-false-read_json_df-schema0-dates_invalid.json][INJECT_OOM, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_valid_timestamps[LEGACY-true-read_json_df-schema0-timestamps.json][INJECT_OOM, APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_read_valid_timestamps[LEGACY-false-read_json_df-schema0-timestamps.json][APPROXIMATE_FLOAT, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_datetime_parsing_fallback_cpu_fallback[dates.json-schema0][ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_datetime_parsing_fallback_cpu_fallback[dates.json-schema1][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 FAILED integration_tests/src/main/python/json_test.py::test_json_datetime_parsing_fallback_cpu_fallback[timestamps.json-schema2][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec)] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the pl...
 = 9 failed, 1143 passed, 12 skipped, 18609 deselected, 93 xfailed, 160 xpassed, 60 warnings in 628.47s (0:10:28) =

 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/mortgage_test.py::test_mortgage[IGNORE_ORDER, INCOMPAT, APPROXIMATE_FLOAT, ALLOW_NON_GPU(ANY), LIMIT(100000)] - TypeError: 'JavaPackage' object is not callable
 =============== 1 failed, 20025 deselected, 60 warnings in 6.76s ===============

 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/orc_test.py::test_orc_scan_with_hidden_metadata_fallback[file_path][INJECT_OOM, ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/orc_test.py::test_orc_scan_with_hidden_metadata_fallback[file_name][INJECT_OOM, ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/orc_test.py::test_orc_scan_with_hidden_metadata_fallback[file_size][INJECT_OOM, ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/orc_test.py::test_orc_scan_with_hidden_metadata_fallback[file_modification_time][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/orc_test.py::test_orc_read_count - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
 = 5 failed, 1636 passed, 2 skipped, 18339 deselected, 44 xfailed, 60 warnings in 3363.75s (0:56:03) =

 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_read_nano_as_longs_true[ALLOW_NON_GPU(FileSourceScanExec, ColumnarToRowExec)] - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_scan_with_hidden_metadata_fallback[file_path][INJECT_OOM, ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_scan_with_hidden_metadata_fallback[file_name][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_scan_with_hidden_metadata_fallback[file_size][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_scan_with_hidden_metadata_fallback[file_modification_time][ALLOW_NON_GPU(ANY)] - pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WI...
 FAILED integration_tests/src/main/python/parquet_test.py::test_parquet_read_count - py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.s...
 = 6 failed, 2529 passed, 23 skipped, 17116 deselected, 308 xfailed, 44 xpassed, 60 warnings in 5339.85s (1:28:59) =

 =========================== short test summary info ============================
 FAILED integration_tests/src/main/python/row-based_udf_test.py::test_hive_empty_simple_udf[INJECT_OOM] - pyspark.errors.exceptions.captured.AnalysisException: [CANNOT_LOAD_FUNCTION...
 FAILED integration_tests/src/main/python/row-based_udf_test.py::test_hive_empty_generic_udf - pyspark.errors.exceptions.captured.AnalysisException: [CANNOT_LOAD_FUNCTION...
 =============== 2 failed, 20024 deselected, 60 warnings in 9.52s ===============

 **Steps/Code to reproduce bug**
1, Create cluster on AWS EMR following [getting-started-aws-emr.md](https://github.com/NVIDIA/spark-rapids/blob/branch-23.08/docs/get-started/getting-started-aws-emr.md) .
2, Run spark-rapids integration tests on the EMR cluster

**Environment details (please complete the following information)**
 - AWS EMR YARN CLUSTER 6.12.0
andygrove commented 1 year ago

Some more details on the failures:

Column name resolution error with metadata fields

Affects test_[csv|orc|parquet]_scan_with_hidden_metadata_fallback

pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `_metadata`.`file_path` cannot be resolved. 

Did you mean one of the following? [`_c0`].; line 1 pos 0; 'Project [_c0#16965, '_metadata.file_path]

Possibly related to changes in Spark 3.5.0 in https://github.com/apache/spark/commit/3baf7f7b7106f3fd30257b793ff4908d0f1ec427

Fall back to BatchScanExec instead of FileSourceScanExec

Affects a number of fallback tests, such as test_csv_datetime_parsing_fallback_cpu_fallback

Test is expecting FileSourceScanExec but finds v2.BatchScanExec

Test assertion failure in test_parquet_read_count

java.lang.AssertionError: assertion failed: Could not find GpuFileGpuScan parquet .* ReadSchema: struct<> in the Spark plan
 E                   GpuColumnarToRow false
 E                   +- GpuHashAggregate(keys=[], functions=[gpucount(1, false)], output=[count(1)#115634L])
 E                      +- GpuShuffleCoalesce 1073741824
 E                         +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [plan_id=213897]
 E                            +- GpuHashAggregate(keys=[], functions=[partial_gpucount(1, false)], output=[count#115637L])
 E                               +- GpuBatchScan parquet hdfs://ip-172-31-0-176.us-west-2.compute.internal:8020/tmp/pyspark_tests/ip-172-31-8-237-main-840-848679351/PARQUET_DATA[] GpuParquetScan DataFilters: [], Format: gpuparquet, Location: InMemoryFileIndex(1 paths)[hdfs://ip-172-31-0-176.us-west-2.compute.internal:8020/tmp/pyspark_tes..., PartitionFilters: [], ReadSchema: struct<>, PushedFilters: [] RuntimeFilters: []
 E

ArrowColumnarDataSourceV2 not found

Affects test_read_* tests

org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.nvidia.spark.rapids.tests.datasourcev2.parquet.ArrowColumnarDataSourceV2. Please find packages at `https://spark.apache.org/third-party-projects.html`.

UDF loading issues

pyspark.errors.exceptions.captured.AnalysisException: [CANNOT_LOAD_FUNCTION_CLASS] Cannot load class com.nvidia.spark.rapids.tests.udf.hive.EmptyHiveSimpleUDF when registering the function `emptysimple`, please make sure it is on the classpath.