NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
826 stars 236 forks source link

[FEA] Improve the driver log not-supported messages for Hive table writes #10045

Open viadea opened 11 months ago

viadea commented 11 months ago

If I write to a Hive Parquet table when spark.sql.hive.convertMetastoreParquet=false, the InsertIntoHiveTable will fallback to CPU because we do not have GPU version of InsertIntoHiveTable support yet. The driver log messages is:

!Exec <DataWritingCommandExec> cannot run on GPU because not all data writing commands can be replaced
  !Output <InsertIntoHiveTable> cannot run on GPU because unsupported output-format found: Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat), only org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat is currently supported; writing Hive delimited text tables has been disabled, to enable this, set spark.rapids.sql.format.hive.text.write.enabled to true; unsupported serde found: Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe), only org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe is currently supported

I think the message is not clear enough to let user know how to avoid CPU fallback for InsertIntoHiveTable. I would suggest we make the message like below:

!Output <InsertIntoHiveTable> cannot run on GPU because unsupported output-format found: Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat). Please enable builtin parquet reader/writer by setting spark.sql.hive.convertMetastoreParquet=true.
mattahrens commented 11 months ago

Is the issue that the current output should not be recommending spark.rapids.sql.format.hive.text.write.enabled to true because it is already true? Instead, you want it to recommend this: spark.sql.hive.convertMetastoreParquet=true. Just want to confirm that is sufficient for the desired output.

viadea commented 11 months ago

@mattahrens Currently Spark RAPIDS can only support Hive table write when spark.sql.hive.convertMetastoreParquet=true. This is default true as well. So if a Spark user disabled spark.sql.hive.convertMetastoreParquet and then above original driver log message will show up. To the user, they did not know what parameter to turn on to avoid the CPU fallback.

That is why I suggest we should mention spark.sql.hive.convertMetastoreParquet=true in the driver log to make sure user enables this parameter to avoid CPU fallback.

revans2 commented 11 months ago

There are multiple different config settings that go into this on the Spark side.

spark.sql.hive.convertMetastoreParquet and spark.sql.hive.convertInsertingPartitionedTable are a few of them. Spark can even throw an exception telling the user to set spark.sql.hive.convertMetastoreParquet to false as a work around to potential errors in how Spark tries to determine the write schema. I don't think we want to tell the user to turn any of these configs on, if someone decided that they should be off.

In addition to that we would need to replicate the logic in https://github.com/apache/spark/blob/5430c700ba64b07cf0c32b906a3328df8a7bef71/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L164-L168 to be able to tell the user what is the correct config to turn on. It might be good to just explain that we cannot support this at this time and leave it at that.

viadea commented 11 months ago

I would suggest make this warning message more straightforward. Currently it will mention:

  1. unsupported output-format found: Some(org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat)
  2. writing Hive delimited text tables has been disabled, to enable this, set spark.rapids.sql.format.hive.text.write.enabled to true;
  3. unsupported serde found: Some(org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe), only org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe is currently supported

It might be confusing to customer on what parameter needs to be turned on. Such as user might blindly enable spark.rapids.sql.format.hive.text.write.enabled=true but actually the parameter is unrelated.

I think based on my experience, there are only below possibilities: a. It is a Hive parquet table but user disabled spark.sql.hive.convertMetastoreParquet or some other Spark parameter as @revans2 mentioned; b. It is a Hive parquet table but user customized Spark so that it somehow could not be translated into a Spark native parque write. c. It is a Hive Text table. (Which means setting spark.rapids.sql.format.hive.text.write.enabled=true is the right solution)

Not sure if our plugin can firstly detect the Hive formats(based on serde) and then show the message accordingly based on a, b, c possibilities? (I know #b is hard, but at least distinguish a and c is doable? )