[FEA] - Some unsupported operations on the GPU

eyalhir74 commented 2 years ago

This epic is to track the following operations for @eyalhir74

https://github.com/NVIDIA/spark-rapids/issues/4656 is for being able to have arrays in the grouping key. It already existed.
https://github.com/NVIDIA/spark-rapids/issues/4932 is to support the array intersect. Be aware that this is likely not going to support nesting arrays under the arrays, at least for a while.
https://github.com/NVIDIA/spark-rapids/issues/4929 is for the min aggregations on arrays.
https://github.com/NVIDIA/spark-rapids/pull/4930 is a PR to turn on support for Structs in sort_array

The following is the original text for this request.

While running a few queries we have (in a huge spark to gpu conversion effort), I see a lot of the issues below. Would be great if we could get some feedback on what can we do next. Currently the performance we get from the GPU is not justifing the migration to RAPIDS :(

!Expression <Cast> cast(xxx#1105 as string) cannot run on GPU because Cast from ArrayType(StringType,true) to StringType is not supported

!Expression <Cast> cast(xxx#1114 as string) cannot run on GPU because Cast from ArrayType(LongType,true) to StringType is not supported

!Expression <Cast> cast(yyyy#1108 as string) cannot run on GPU because Cast from ArrayType(StructType(StructField(key,StringType,true), StructField(value_pairsList,ArrayType(StructType(StructField(key,StringType,true), StructField(value,DoubleType,true)),true),true)),true) to StringType is not supported

!Expression <Cast> cast(xxx#1109 as string) cannot run on GPU because Cast from ArrayType(StructType(StructField(count,LongType,true), StructField(publisherId,LongType,true)),true) to StringType is not supported

!Expression <Min> min(yyyyy#504) cannot run on GPU because expression Min min(pv_userVisitedPublishers#504) produces an unsupported type ArrayType(StructType(StructField(count,LongType,true), StructField(publisherId,LongType,true)),true); input expression AttributeReference pv_userVisitedPublishers#504 (ArrayType(StructType(StructField(count,LongType,true), StructField(publisherId,LongType,true)),true) is not supported)

!Expression <Min> min(xxx#28) cannot run on GPU because expression Min min(pv_alchemyTaxonomy#28) produces an unsupported type ArrayType(StringType,true); input expression AttributeReference pv_alchemyTaxonomy#28 (ArrayType(StringType,true) is not supported)

#Exec <SortExec> could run on GPU but is going to be removed because replacing sortMergeJoin with shuffleHashJoin

!Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced; ArrayTypes or MapTypes in grouping expressions are not supported

!Expression <SortArray> sort_array(InfoList#1123, true) cannot run on GPU because expression SortArray sort_array(InfoList#1123, true) produces an unsupported type ArrayType(StructType(StructField(xxx,LongType,true), StructField(yyyy,LongType,true), StructField(zzz,LongType,true)),true); array expression AttributeReference InfoList#1123 (child StructType(StructField(xxx,LongType,true), StructField(yyy,LongType,true), StructField(zzz,LongType,true)) is not supported) !NOT_FOUND <ArrayIntersect> array_intersect(sort_array(map_keys(xx1#1134), true), [READ_MORE,EXPLORE_MORE,NEXT_UP]) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ArrayIntersect could be found

revans2 commented 2 years ago

@eyalhir74 can you include what version of the RAPIDS Accelerator you are using?

revans2 commented 2 years ago

The casting of arrays to strings appears to have been fixed by #4221 which went into the 22.02 release. So that should help with some of your issues if you move to the latest version.

For the other things I cannot promise anything on when they would be available. I'll talk it over with management and then I'll file follow on issues for each of these things if we don't have one explicitly for it already.

We are still working with CUDF on the best way to implement comparison and sorting of arrays. A min aggregation on an array of structs falls into that and is probably a few releases away before we can support this.

SortArray on structs is not something that we have done, but with the work we have done to support general sorting on structs without arrays or maps in them, I think we could make it work. It will likely require a few changes to CUDF.

ArrayIntersect is new and not something that we have done yet. It is an interesting use case because it includes a small scalar array as right hand side of the operation. I think we could hack something together for your specific use case based off of existing CUDF functionality, but we probably want to do this right instead, and would need to work with CUDF to implement the functionality.

eyalhir74 commented 2 years ago

@revans2 I'm using this version: RAPIDS Accelerator 21.12.0 using cudf 21.12.0

Would it help to push it forward if you'd know the scope of the potential spark+gpu work we're looking at?

I'm still working on figuring it out and try and build a syntetic case, but from what I can see now the major problem, performance wise, is that:

We have this schema, where we have 1,500,000,000 rows (only part of the data needs to be processed) and array can be filled with 1K-5K elements |-- arr: array (nullable = true) | |-- element: long (containsNull = true) |-- index: integer (nullable = true)

The performance bottleneck comes from this operation: select arr, count(1) from tbl group by 1

I believe that should this issue somehow be resolved either through rapids/cudf or some sql hack, we'd be able to move forward with this huge migration project.

revans2 commented 2 years ago

@eyalhir74 That is very very helpful information. What data types are in the array? Are they just basic types, like ints or longs, or are they structs like in other parts of the query you have posted? ArrayType(StructType(StructField(xxx,LongType,true), StructField(yyyy,LongType,true), StructField(zzz,LongType,true)),true)

eyalhir74 commented 2 years ago

@revans2 the problematic array that I've posted above, which repeats in lot of our queries, is an array of longs. It also seems that when array of structs/array of strings/other complex structures appear only in the select phrase, it doesn't affect performance too much.

The above example is part of a very complex query where we also do explode on the 5K elements. The explode part runs on the V100 about 7-9 times faster than the equivalent sub-query running on the CPU.

We have even the following complex strucutre, for example, in another query. However the structs elements are only accessed in the select phrase and the query does show a nice speedup on the GPU.

[UPDATE] - there are literaly tens of different queries on this huge data. Here's another structure on which we do a group by :(

viadea commented 2 years ago

@eyalhir74 Your migration project is very interesting to us. Do you mind connecting with me (hazhu@nvidia.com) and discuss further? This will help our product management team to decide on the feature requests you need.

eyalhir74 commented 2 years ago

@viadea I've sent you an intro email.

revans2 commented 2 years ago

@eyalhir74 grouping by a map is not allowed in Spark I get errors like the following

org.apache.spark.sql.AnalysisException: expression m cannot be used as a grouping expression because its data type map<string,bigint> is not an orderable data type.;

Are you sure that the grouping key is a map? Or are you converting the map to a sorted list of structs with array_sort(map_entries(...))?

revans2 commented 2 years ago

@eyalhir74 I filed a bunch of follow on issues/found existing requests for the functionality

https://github.com/NVIDIA/spark-rapids/issues/4656 is for being able to have arrays in the grouping key. It already existed.
https://github.com/NVIDIA/spark-rapids/issues/4932 is to support the array intersect. Be aware that this is likely not going to support nesting arrays under the arrays, at least for a while.
https://github.com/NVIDIA/spark-rapids/issues/4929 is for the min aggregations on arrays.
https://github.com/NVIDIA/spark-rapids/pull/4930 is a PR to turn on support for Structs in sort_array

I think I have something for all of the issues you have filed. If you agree you can switch over to tracking those individually and close this one, or we can switch this over to an epic where we can track them in one place.

eyalhir74 commented 2 years ago

@revans2 I'd be happy with one epic for all of them, but your call :)

jlowe commented 6 months ago

Closing as all issues linked in the description have been resolved.

NVIDIA / spark-rapids

[FEA] - Some unsupported operations on the GPU #4900