NVIDIA / spark-rapids-jni

RAPIDS Accelerator JNI For Apache Spark
Apache License 2.0
36 stars 64 forks source link

[FEA] Custom Hive Text File Parser #767

Open revans2 opened 1 year ago

revans2 commented 1 year ago

Is your feature request related to a problem? Please describe. in several discussions with CUDF we have come to the conclusion that the CSV parser is not likely to get a lot of love/fixes any time soon unless we do those fixes ourselves. We have some goals to support the Hive Text format in the next release 23.02, but with the complexity in CUDF parser I think it is going to be simpler for us to write a custom parser ourselves in the short term, and target it directly at the Hive Text file format, specifically the default settings for the HiveTextFile format. We can discuss other settings that might be common with the HiveTextFile format.

Describe the solution you'd like I would like to have an API that takes a String column as input (we already have split each of the rows), and list of columns to keep. It would then return a table of string columns that we would then parse further into smaller parts. The main goal would be to split on the record deliminator, and handle quotes and escapes correctly.

Describe alternatives you've considered We fix all of the bugs and new features in CUDF that are needed to do this.

revans2 commented 1 year ago

This ended up not being needed for the current hive text file that does not use escapes by default.