Open viadea opened 11 months ago
Is the issue that the current output should not be recommending spark.rapids.sql.format.hive.text.write.enabled to true
because it is already true? Instead, you want it to recommend this: spark.sql.hive.convertMetastoreParquet=true
. Just want to confirm that is sufficient for the desired output.
@mattahrens Currently Spark RAPIDS can only support Hive table write when spark.sql.hive.convertMetastoreParquet=true. This is default true as well. So if a Spark user disabled spark.sql.hive.convertMetastoreParquet and then above original driver log message will show up. To the user, they did not know what parameter to turn on to avoid the CPU fallback.
That is why I suggest we should mention spark.sql.hive.convertMetastoreParquet=true in the driver log to make sure user enables this parameter to avoid CPU fallback.
There are multiple different config settings that go into this on the Spark side.
spark.sql.hive.convertMetastoreParquet
and spark.sql.hive.convertInsertingPartitionedTable
are a few of them. Spark can even throw an exception telling the user to set spark.sql.hive.convertMetastoreParquet
to false as a work around to potential errors in how Spark tries to determine the write schema. I don't think we want to tell the user to turn any of these configs on, if someone decided that they should be off.
In addition to that we would need to replicate the logic in https://github.com/apache/spark/blob/5430c700ba64b07cf0c32b906a3328df8a7bef71/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L164-L168 to be able to tell the user what is the correct config to turn on. It might be good to just explain that we cannot support this at this time and leave it at that.
I would suggest make this warning message more straightforward. Currently it will mention:
It might be confusing to customer on what parameter needs to be turned on. Such as user might blindly enable spark.rapids.sql.format.hive.text.write.enabled=true
but actually the parameter is unrelated.
I think based on my experience, there are only below possibilities:
a. It is a Hive parquet table but user disabled spark.sql.hive.convertMetastoreParquet
or some other Spark parameter as @revans2 mentioned;
b. It is a Hive parquet table but user customized Spark so that it somehow could not be translated into a Spark native parque write.
c. It is a Hive Text table. (Which means setting spark.rapids.sql.format.hive.text.write.enabled=true
is the right solution)
Not sure if our plugin can firstly detect the Hive formats(based on serde) and then show the message accordingly based on a, b, c possibilities? (I know #b is hard, but at least distinguish a and c is doable? )
If I write to a Hive Parquet table when spark.sql.hive.convertMetastoreParquet=false, the InsertIntoHiveTable will fallback to CPU because we do not have GPU version of InsertIntoHiveTable support yet. The driver log messages is:
I think the message is not clear enough to let user know how to avoid CPU fallback for InsertIntoHiveTable. I would suggest we make the message like below: