Open viadea opened 2 years ago
We can start to put in some support for UDTs, but a UDT is a java class that provides ways to translate to/from other standard SQL types. We can read in the standard SQL types but then it is going to take some work to understand exactly when/where the translations to/from the java type happen in Spark and making sure we can plumb all of that through. That is the main reason we have not added any support for UDTs yet.
How is this used? typically someone will add in support for a UDT because they want to interact with this as a java class instead of as a SQL struct. Which means we are not likely going to be able to do much with this once it is read in except send it to the CPU for more processing.
It should be a use case from ML side. Currently our 2 supported ML case are XGBoost and PCA, both of them are using VectorUDT in their CPU version. The entry point of this could be VectorAssember(merge multiple columns into one) or customized UDF
(cast ArrayType to VectorUDT) like:
import org.apache.spark.ml.linalg.Vectors
val convertToVector = udf((array: Seq[Double]) => {
Vectors.dense(array.map(_.toDouble).toArray)
})
For the first case, XGBoost added support for multiple columns as input. For the second case, PCA also support ArrayType column directly.
Do you have an example full workflow/query that you want to have optimized? Adding in support for UDTs is possibly a lot of work and it would be nice to know what areas we should concentrate on first. Looking at VectorUDT there are two implementations. One for sparse and another for dense vectors. Each line could be one or the other depending on the data in it. So reading out that data from parquet should not be too difficult, but how is it going to be used? Are we going to have to support user defined functions that take user defined types? I am just concerned that this is the first layer of an onion and we can add in what you are asking, but I don't think it is going to help in terms of performance until we do a lot of follow on work too.
@revans2 I could share more details offline. Here are some of the needed operators on vector type based on logs:
All of those except CollectLimitExec look to be doable. CollectLimitExec we do not currently support because of a performance optimization in Spark that we just have not felt the need to support. If you have a real use case that is not just show, then we should talk about an issue to add in support for that generally too.
I wish we can support data type org.apache.spark.mllib.linalg.VectorUDT.
Mini repro:
Unsupported messages: