NVIDIA / spark-rapids-benchmarks

Spark RAPIDS Benchmarks – benchmark sets and utilities for the RAPIDS Accelerator for Apache Spark
Apache License 2.0
36 stars 27 forks source link

[BUG] Got error "cannot resolve 'd_date' given input columns" when run nds_maintenance.py #161

Open johnnyzhon opened 1 year ago

johnnyzhon commented 1 year ago
  1. generate data via nds_gen_data.py

  2. covert raw data to delta (--delta_unmanaged --compression gzip): /spark/bin/spark-submit --master spark://172.25.0.2:7077 --deploy-mode client --driver-memory 20G --conf spark.scheduler.minRegisteredResourcesRatio=1.0 --conf spark.dynamicAllocation.enabled=false --conf spark.rapids.sql.enabled=false --conf spark.sql.files.maxPartitionBytes=2g --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.legacy.charVarcharAsString=true --conf spark.sql.warehouse.dir=/usr/workspace/spark-warehouse --jars /usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar --packages io.delta:delta-core_2.12:2.0.1 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog /usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/nds_transcode.py /usr/workspace/logs/temp/nds_raw/ /usr/workspace/logs/temp/nds_dataset/ /usr/workspace/logs/nds/nds_metric-convert-1689321388 --output_mode overwrite --output_format delta --delta_unmanaged --compression gzip

  3. execute power run successful.

  4. run nds_maintenance.py: /spark/bin/spark-submit --master spark://172.25.0.2:7077 --deploy-mode client --driver-memory 20G --conf spark.rapids.sql.concurrentGpuTasks=2 --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.enabled=true --conf spark.sql.files.maxPartitionBytes=2g --conf spark.rapids.memory.pinnedPool.size=8g --conf spark.rapids.sql.explain=NOT_ON_GPU --conf spark.rapids.sql.incompatibleOps.enabled=true --conf spark.rapids.sql.variableFloatAgg.enabled=true --conf spark.rapids.sql.csv.read.date.enabled=true --conf spark.rapids.sql.csvTimestamps.enabled=false --conf spark.rapids.sql.csv.read.integer.enabled=true --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED --conf spark.sql.legacy.charVarcharAsString=true --conf spark.sql.warehouse.dir=/usr/workspace/spark-warehouse --jars /usr/workspace/deploy/plugin/rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar,/usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/jvm_listener/target/nds-benchmark-listener-1.0-SNAPSHOT.jar --packages io.delta:delta-core_2.12:2.0.1 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog /usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/nds_maintenance.py /usr/workspace/spark-warehouse /usr/workspace/logs/temp/nds_raw_update /usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/data_maintenance /usr/workspace/logs/nds/nds_metric-maintenance-1689322475 --warehouse_type delta --delta_unmanaged

  5. got following error: 23/07/14 08:15:36 INFO DeltaLog: Updated snapshot to Snapshot(path=file:/usr/workspace/spark-warehouse/store_sales/_delta_log, version=0, metadata=Metadata(490a9613-2b6d-4420-a7b8-3fffb641a66e,null,null,Format(parquet,Map()),{"type":"struct","fields":[]},List(),Map(),Some(1689322536203)), logSegment=LogSegment(file:/usr/workspace/spark-warehouse/store_sales/_delta_log,0,WrappedArray(DeprecatedRawLocalFileStatus{path=file:/usr/workspace/spark-warehouse/store_sales/_delta_log/00000000000000000000.json; isDirectory=false; length=626; replication=1; blocksize=33554432; modification_time=1689322536216; access_time=1689322536208; owner=; group=; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=false; isErasureCoded=false}),List(),None,1689322536216), checksumOpt=None) 23/07/14 08:15:36 INFO OptimisticTransaction: [tableId=0b3a44c8,txnId=ab1a7b54] Committed delta #0 to file:/usr/workspace/spark-warehouse/store_sales/_delta_log 23/07/14 08:15:36 INFO CreateDeltaTableCommand: Table is path-based table: false. Update catalog with mode: Create ====== Registering for table store_sales ====== Time taken: 802 millis for table store_sales ====== Run LF_CR ====== TaskFailureListener is registered. 23/07/14 08:15:37 WARN GpuOverrides: ! cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec

23/07/14 08:15:37 INFO GpuOverrides: Plan conversion to the GPU took 0.67 ms 23/07/14 08:15:37 INFO GpuOverrides: GPU plan transition optimization took 0.09 ms 23/07/14 08:15:37 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. ERROR BEGIN cannot resolve 'd_date' given input columns: [s_catalog_returns.cret_call_center_id, s_catalog_returns.cret_catalog_page_id, s_catalog_returns.cret_item_id, s_catalog_returns.cret_line_number, s_catalog_returns.cret_merchant_credit, s_catalog_returns.cret_order_id, s_catalog_returns.cret_reason_id, s_catalog_returns.cret_refund_customer_id, s_catalog_returns.cret_refunded_cash, s_catalog_returns.cret_return_amt, s_catalog_returns.cret_return_customer_id, s_catalog_returns.cret_return_date, s_catalog_returns.cret_return_fee, s_catalog_returns.cret_return_qty, s_catalog_returns.cret_return_ship_cost, s_catalog_returns.cret_return_tax, s_catalog_returns.cret_return_time, s_catalog_returns.cret_reversed_charge, s_catalog_returns.cret_shipmode_id, s_catalog_returns.cret_warehouse_id]; line 33 pos 38; 'CreateViewCommand crv, SELECT d_date_sk cr_returned_date_sk ,t_time_sk cr_returned_time_sk ,i_item_sk cr_item_sk ,c1.c_customer_sk cr_refunded_customer_sk ,c1.c_current_cdemo_sk cr_refunded_cdemo_sk ,c1.c_current_hdemo_sk cr_refunded_hdemo_sk ,c1.c_current_addr_sk cr_refunded_addr_sk ,c2.c_customer_sk cr_returning_customer_sk ,c2.c_current_cdemo_sk cr_returning_cdemo_sk ,c2.c_current_hdemo_sk cr_returning_hdemo_sk ,c2.c_current_addr_sk cr_returing_addr_sk ,cc_call_center_sk cr_call_center_sk ,cp_catalog_page_sk CR_CATALOG_PAGE_SK ,sm_ship_mode_sk CR_SHIP_MODE_SK ,w_warehouse_sk CR_WAREHOUSE_SK ,r_reason_sk cr_reason_sk ,cret_order_id cr_order_number ,cret_return_qty cr_return_quantity ,cret_return_amt cr_return_amt ,cret_return_tax cr_return_tax ,cret_return_amt + cret_return_tax AS cr_return_amt_inc_tax ,cret_return_fee cr_fee ,cret_return_ship_cost cr_return_ship_cost ,cret_refunded_cash cr_refunded_cash ,cret_reversed_charge cr_reversed_charge ,cret_merchant_credit cr_merchant_credit ,cret_return_amt+cret_return_tax+cret_return_fee -cret_refunded_cash-cret_reversed_charge-cret_merchant_credit cr_net_loss FROM s_catalog_returns LEFT OUTER JOIN date_dim ON (cast(cret_return_date as date) = d_date) LEFT OUTER JOIN time_dim ON ((CAST(substr(cret_return_time,1,2) AS integer)3600 +CAST(substr(cret_return_time,4,2) AS integer)60 +CAST(substr(cret_return_time,7,2) AS integer)) = t_time) LEFT OUTER JOIN item ON (cret_item_id = i_item_id) LEFT OUTER JOIN customer c1 ON (cret_return_customer_id = c1.c_customer_id) LEFT OUTER JOIN customer c2 ON (cret_refund_customer_id = c2.c_customer_id) LEFT OUTER JOIN reason ON (cret_reason_id = r_reason_id) LEFT OUTER JOIN call_center ON (cret_call_center_id = cc_call_center_id) LEFT OUTER JOIN catalog_page ON (cret_catalog_page_id = cp_catalog_page_id) LEFT OUTER JOIN ship_mode ON (cret_shipmode_id = sm_ship_mode_id) LEFT OUTER JOIN warehouse ON (cret_warehouse_id = w_warehouse_id) WHERE i_rec_end_date IS NULL AND cc_rec_end_date IS NULL, false, false, LocalTempView, false +- 'Project ['d_date_sk AS cr_returned_date_sk#6675, 't_time_sk AS cr_returned_time_sk#6676, 'i_item_sk AS cr_item_sk#6677, 'c1.c_customer_sk AS cr_refunded_customer_sk#6678, 'c1.c_current_cdemo_sk AS cr_refunded_cdemo_sk#6679, 'c1.c_current_hdemo_sk AS cr_refunded_hdemo_sk#6680, 'c1.c_current_addr_sk AS cr_refunded_addr_sk#6681, 'c2.c_customer_sk AS cr_returning_customer_sk#6682, 'c2.c_current_cdemo_sk AS cr_returning_cdemo_sk#6683, 'c2.c_current_hdemo_sk AS cr_returning_hdemo_sk#6684, 'c2.c_current_addr_sk AS cr_returing_addr_sk#6685, 'cc_call_center_sk AS cr_call_center_sk#6686, 'cp_catalog_page_sk AS CR_CATALOG_PAGE_SK#6687, 'sm_ship_mode_sk AS CR_SHIP_MODE_SK#6688, 'w_warehouse_sk AS CR_WAREHOUSE_SK#6689, 'r_reason_sk AS cr_reason_sk#6690, 'cret_order_id AS cr_order_number#6691, 'cret_return_qty AS cr_return_quantity#6692, 'cret_return_amt AS cr_return_amt#6693, 'cret_return_tax AS cr_return_tax#6694, ('cret_return_amt + 'cret_return_tax) AS cr_return_amt_inc_tax#6695, 'cret_return_fee AS cr_fee#6696, 'cret_return_ship_cost AS cr_return_ship_cost#6697, 'cret_refunded_cash AS cr_refunded_cash#6698, ... 3 more fields] +- 'Filter (isnull('i_rec_end_date) AND isnull('cc_rec_end_date)) +- 'Join LeftOuter, ('cret_warehouse_id = 'w_warehouse_id) :- 'Join LeftOuter, ('cret_shipmode_id = 'sm_ship_mode_id) : :- 'Join LeftOuter, ('cret_catalog_page_id = 'cp_catalog_page_id) : : :- 'Join LeftOuter, ('cret_call_center_id = 'cc_call_center_id) : : : :- 'Join LeftOuter, ('cret_reason_id = 'r_reason_id) : : : : :- 'Join LeftOuter, ('cret_refund_customer_id = 'c2.c_customer_id) : : : : : :- 'Join LeftOuter, ('cret_return_customer_id = 'c1.c_customer_id) : : : : : : :- 'Join LeftOuter, ('cret_item_id = 'i_item_id) : : : : : : : :- 'Join LeftOuter, ((((cast('substr('cret_return_time, 1, 2) as int) 3600) + (cast('substr('cret_return_time, 4, 2) as int) 60)) + cast('substr('cret_return_time, 7, 2) as int)) = 't_time) : : : : : : : : :- 'Join LeftOuter, (cast(cret_return_date#150 as date) = 'd_date) : : : : : : : : : :- SubqueryAlias s_catalog_returns : : : : : : : : : : +- View (s_catalog_returns, [cret_call_center_id#144,cret_order_id#145,cret_line_number#146,cret_item_id#147,cret_return_customer_id#148,cret_refund_customer_id#149,cret_return_date#150,cret_return_time#151,cret_return_qty#152,cret_return_amt#153,cret_return_tax#154,cret_return_fee#155,cret_return_ship_cost#156,cret_refunded_cash#157,cret_reversed_charge#158,cret_merchant_credit#159,cret_reason_id#160,cret_shipmode_id#161,cret_catalog_page_id#162,cret_warehouse_id#163]) : : : : : : : : : : +- Relation [cret_call_center_id#144,cret_order_id#145,cret_line_number#146,cret_item_id#147,cret_return_customer_id#148,cret_refund_customer_id#149,cret_return_date#150,cret_return_time#151,cret_return_qty#152,cret_return_amt#153,cret_return_tax#154,cret_return_fee#155,cret_return_ship_cost#156,cret_refunded_cash#157,cret_reversed_charge#158,cret_merchant_credit#159,cret_reason_id#160,cret_shipmode_id#161,cret_catalog_page_id#162,cret_warehouse_id#163] csv : : : : : : : : : +- SubqueryAlias spark_catalog.default.date_dim : : : : : : : : : +- Relation default.date_dim[] parquet : : : : : : : : +- SubqueryAlias spark_catalog.default.time_dim : : : : : : : : +- Relation default.time_dim[] parquet : : : : : : : +- SubqueryAlias spark_catalog.default.item : : : : : : : +- Relation default.item[] parquet : : : : : : +- SubqueryAlias c1 : : : : : : +- SubqueryAlias spark_catalog.default.customer : : : : : : +- Relation default.customer[] parquet : : : : : +- SubqueryAlias c2 : : : : : +- SubqueryAlias spark_catalog.default.customer : : : : : +- Relation default.customer[] parquet : : : : +- SubqueryAlias spark_catalog.default.reason : : : : +- Relation default.reason[] parquet : : : +- SubqueryAlias spark_catalog.default.call_center : : : +- Relation default.call_center[] parquet : : +- SubqueryAlias spark_catalog.default.catalog_page : : +- Relation default.catalog_page[] parquet : +- SubqueryAlias spark_catalog.default.ship_mode : +- Relation default.ship_mode[] parquet +- SubqueryAlias spark_catalog.default.warehouse +- Relation default.warehouse[] parquet

File "/usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/PysparkBenchReport.py", line 87, in report_on fn(*args) File "/usr/workspace/deploy/temp/spark-rapids-benchmarks/nds/nds_maintenance.py", line 202, in run_dm_query spark.sql(q) File "/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in call return_value = get_return_value( File "/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None ERROR END Time taken: [312] millis for LF_CR ====== Run LF_CS ====== TaskFailureListener is registered.

wjxiz1992 commented 1 year ago

Thanks for reporting this.

According to NDS benchmark requirement, we use managed table for Delta case where user should leverage Hive Metastore to store the table information.

And for this unmanaged table case, it looks that the table date_dim is not registered, I will try to reproduce this locally for further diagnostic.