NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
782 stars 228 forks source link

[FEA] Support crc32 function #8576

Open nvliyuan opened 1 year ago

nvliyuan commented 1 year ago

It would be better to support crc32 function: steps: create a hive table:

hive> show create table order_orc4;
OK
CREATE TABLE `order_orc4`(
  `order_id` int,
  `order_date` string,
  `order_customer_id` int,
  `order_status` string)
PARTITIONED BY (
  `date` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://10.19.183.210:9000/user/hive_remote/warehouse/order_orc4'
TBLPROPERTIES (
  'bucketing_version'='2',
  'orc.compression'='SNAPPY',
  'transient_lastDdlTime'='1663645093')

insert some rows:

hive> select * from order_orc4;
OK
1   2022-09-10  1   desc    20220920
2   2022-09-20  2   desc    20220920

run SQL with crc32 function:

spark.sql("select crc32(order_status) from order_orc4").show

driverlogs:

! <Crc32> crc32(cast(order_status#34 as binary)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.Crc32
mattahrens commented 1 year ago

What is the expected data size for the strings in the example use case?

nvliyuan commented 1 year ago

In some cases, the string may reach a length of 200.