facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.42k stars 1.12k forks source link

Java API for Velox plan #10763

Closed hn5092 closed 4 days ago

hn5092 commented 1 month ago

Description

Generally speaking, to invoke Velox from Java now, one needs to construct from the Substrait side, creating a Substrait execution plan. The whole usage and learning cost is very high, requiring a deep understanding of the entire SQL and also a profound knowledge of Substrait. This necessitates mastering a lot of things, and for developers, it is very inconvenient. Complexity implies that the cost of bug maintenance is huge. Here, I propose the idea of building a JAVA SDK.

Goals:

  1. Scan a big list

    case class Data(A: Int, b: String)
    velox.values(Array[Data]).agg("b", "count(a)").collect
  2. Scan a formatted file

    velox.scan.parquet("filePath", schema, conf).agg("b", "count(a)").collect
  3. Generate a stage execution plan

    velox.scan.parquet("filePath", schema, conf).select("a", "b").planJson

What needs to be done:

  1. JNI framework:

C++: a. Native class, implementing this object can easily interface with a Java class. b. Framework access to Java objects is also simple. c. High performance.

Java: a. Native class, an abstract class of the entire native object, inside which some checks can be done. Implementing this method of objects will have some default behaviors to prevent crashes like this.

  1. Function resolution

  2. Transfer of intermediate results

  3. Velox Vector JAVA API

hn5092 commented 1 month ago

Our engine, Starry DB, has already implemented a similar set of APIs, from which we have reaped huge benefits. Two of us completed the development and launch of this Spark native engine in half a year. If possible, we would like to share all of this with everyone.

zhztheplayer commented 4 weeks ago

This could be very helpful for JVM frontends.

case class Data(A: Int, b: String)

Curious did you mean Java + Scala SDK? Since it's a Scala syntax.

kgpai commented 4 weeks ago

cc: @pedroerp @mbasmanova

pedroerp commented 4 weeks ago

We had discussions in the past about creating Java/JNI bindings, but I don't think anyone has actually done the work. Agreed it would be useful as more Java engines try to leverage Velox.

The one complication is that Velox doesn't really have a single public API (rather, there are multiple APIs working in different layers), so I suppose the first discussion would be to agree on the exact APIs we would like to expose. It would also be nice if we made them (at least mostly) compatible across C++ and pyVelox also.

aditi-pandit commented 4 weeks ago

+1 For the Velox API discussion. At IBM we have lots of engineers writing functions and having a Java API would be super useful.

assignUser commented 4 weeks ago

I think the API topic (in general, not specifically for java) should be discussed in the next monthly? @pedroerp I currently don't see anything planned, when will the next one be?

mbasmanova commented 3 weeks ago

+1 For the Velox API discussion. At IBM we have lots of engineers writing functions and having a Java API would be super useful.

@aditi-pandit My understanding of the proposal is to to provide Java API for PlanNode.h, not for writing functions. @hn5092 Would you clarify?

pedroerp commented 3 weeks ago

I think the API topic (in general, not specifically for java) should be discussed in the next monthly? @pedroerp I currently don't see anything planned, when will the next one be?

True, would be a good topic to discuss on that forum. @sanumandla has the schedule; I think that event expired, but we should probably it bring to back to life.

mbasmanova commented 3 weeks ago

CC: @FelixYBW @rui-mo

FelixYBW commented 3 weeks ago

The most urgent java API usage from Gluten is the wrapper to access the RowVector. One major performance issue from Gluten today is the UDFs. Simple UDF can be easy to port with the help of LLM today. But as long as the UDF uses some java package like org.apache.http.client.utils.URIBuilder, it ends up to search or port the whole package.

If Velox has java API to access the rowvector data, we still need to port the row based UDF to columnar based but it's will be much more easier. Also we can improve performance by removing the R2C/C2R or similar logic.

hn5092 commented 3 weeks ago

The most urgent java API usage from Gluten is the wrapper to access the RowVector. One major performance issue from Gluten today is the UDFs. Simple UDF can be easy to port with the help of LLM today. But as long as the UDF uses some java package like org.apache.http.client.utils.URIBuilder, it ends up to search or port the whole package.

If Velox has java API to access the rowvector data, we still need to port the row based UDF to columnar based but it's will be much more easier. Also we can improve performance by removing the R2C/C2R or similar logic.

+1 For the Velox API discussion. At IBM we have lots of engineers writing functions and having a Java API would be super useful.

@aditi-pandit My understanding of the proposal is to to provide Java API for PlanNode.h, not for writing functions. @hn5092 Would you clarify?

The Java API for plannode is certainly a very important and fundamental one, and it is also a set of APIs that we have implemented. However, in my imagination, Velox could do much more. For example, by passing a Java list to it, it could perform calculations directly, without the need for engines like Spark. It's like a high-performance collection framework. Or, similar to DuckDB, it could directly read Parquet files and then perform computations, because many Java programs actually need this functionality very much, and the current Java/Scala lambda expressions cannot provide such efficient operations.

hn5092 commented 3 weeks ago

The most urgent java API usage from Gluten is the wrapper to access the RowVector. One major performance issue from Gluten today is the UDFs. Simple UDF can be easy to port with the help of LLM today. But as long as the UDF uses some java package like org.apache.http.client.utils.URIBuilder, it ends up to search or port the whole package.

If Velox has java API to access the rowvector data, we still need to port the row based UDF to columnar based but it's will be much more easier. Also we can improve performance by removing the R2C/C2R or similar logic.

Our design entails a mapping between the entire vector batch and the underlying C++ Velox vector, with direct data read and write operations via unsafe. This API suite currently supports all Spark types and has shown to be quite competitive in performance when tested against Spark's own vector benchmark. (From what I know, the performance of the dictionary type is not very good in vector batch implementations like Arrow in the open-source community.)

We have replaced Spark's ParquetFormat vector batch with our vector implementation to address the issue of Velox Parquet not supporting many types. This replacement only involved modifications to the batch code.

In terms of UDFs, we can develop a vector eval on the Spark side to carry out row-wise evaluation on the batch. Then, we can insert data into the Velox batch using our batch API.

image

zhztheplayer commented 3 weeks ago

@hn5092 Thank you for sharing. Are you working with Gluten or with your own solution?

We have replaced Spark's ParquetFormat vector batch with our vector implementation to address the issue of Velox Parquet not supporting many types.

This seems interesting. IIUC it's a pure Java implementation of ParquetFormat, and backed with Velox in-memory data format, is that correct? Did you try comparing this ParquetFormat with Velox's native Parquet reader / writer in regard of performance?

Furthermore, I think we are basically discussing on both Velox query plan and Velox data format in the topic. Perhaps the two can be split. My personal feeling is implementing a plan API could anyway help on Java + Velox development. For data format, perhaps we can figure out the relationship between the new Java Velox format and existing Velox + Arrow solution at the beginning. Sharing data between Java and C++ may require for a well designed Velox C ABI, which could functionally overlap with Velox Arrow bridge a little bit. Do you have some thoughts on this part?

FelixYBW commented 3 weeks ago

In terms of UDFs, we can develop a vector eval on the Spark side to carry out row-wise evaluation on the batch. Then, we can insert data into the Velox batch using our batch API.

We already had the vector based UDFs in Gluten. current solution is to convert to Arrow format, then use Arrow's java API to access the data. This part does can be upstream to Spark but not much useful because Spark only has parquet scan output the columnar format.

The initial plan in Velox is that we can should have data level consistency so we can get one data format, two wrappers (Arrow's record batch and Velox's rowvector). But it doesn't work well yet.

@pedroerp should we add velox dataformat's java wrapper or should we leverage Arrow's java wrapper?

FelixYBW commented 3 weeks ago

Furthermore, I think we are basically discussing on both Velox query plan and Velox data format in the topic.

Data format one is more important to Gluten. Gluten needn't the Velox query plan's java wrapper.

FelixYBW commented 3 weeks ago

@hn5092 you may move the topic to discussion

hn5092 commented 3 weeks ago

@hn5092 Thank you for sharing. Are you working with Gluten or with your own solution?

We have replaced Spark's ParquetFormat vector batch with our vector implementation to address the issue of Velox Parquet not supporting many types.

This seems interesting. IIUC it's a pure Java implementation of ParquetFormat, and backed with Velox in-memory data format, is that correct? Did you try comparing this ParquetFormat with Velox's native Parquet reader / writer in regard of performance?

Furthermore, I think we are basically discussing on both Velox query plan and Velox data format in the topic. Perhaps the two can be split. My personal feeling is implementing a plan API could anyway help on Java + Velox development. For data format, perhaps we can figure out the relationship between the new Java Velox format and existing Velox + Arrow solution at the beginning. Sharing data between Java and C++ may require for a well designed Velox C ABI, which could functionally overlap with Velox Arrow bridge a little bit. Do you have some thoughts on this part?

  1. Performance-wise Velox's Parquet performance is definitely better than Spark's, but in terms of compatibility, data types, and filesystems (such as HDFS), Spark's native Parquet is somewhat superior. I was compelled to choose this solution initially (because Velox does not support timestamp types). Our current approach can also directly support other types, such as ORC, etc. At the same time, we also have a columnar distributed in-memory cache (cache Velox batch), so we haven't considered Velox's Parquet for the time being. If possible, I might port over Impala's Parquet later on, which is the world's fastest and has the best compatibility.

  2. We gave up on Arrow We previously tried using Gluten and also attempted to implement our API on Arrow, but the performance was not satisfactory. Based on our past experience using Impala as the backend (we had also written a set of Java APIs for Impala rowbatch), we decided to write our own Velox batch Java API. Of course, our implementation is still a bit away from being contributable; constant dictionary types are not yet supported (currently, when we encounter this type, we just flatten it).

Additionally, I haven't yet seen any necessary benefits that Arrow provides for Java APIs. If there are advantages that I am not aware of, you can inform me.

aditi-pandit commented 3 weeks ago

@mbasmanova : From the IBM use-cases that I'm familiar with are close to what BinWei described above. The existing functions (scalars or aggregates) are written in Java and we want to wrap the row by row logic (or some functions could be columnar as well) within a Velox operator that can be invoked through SQL.

Remote UDF can be used for this also, though the Java API would be more performant.

This issue has been revised and rewritten few times, but when I commented that was my line of thought.

Velox PlanNode API came up in conversation few times when we discussed PlanValidation of native plans at the co-ordinator side.

hn5092 commented 2 weeks ago

I hope this matter won't be put on hold. In the next few weeks, I'll take the time to submit a draft of the Velox Vector Java API for everyone to review. Please take a look and provide your own feedback. Upon completion, we will proceed with Phase 2, which involves the development of the Velox Node Plan Java API.

pedroerp commented 4 days ago

The Java API for plannode is certainly a very important and fundamental one, and it is also a set of APIs that we have implemented. However, in my imagination, Velox could do much more. For example, by passing a Java list to it, it could perform calculations directly, without the need for engines like Spark. It's like a high-performance collection framework. Or, similar to DuckDB, it could directly read Parquet files and then perform computations, because many Java programs actually need this functionality very much, and the current Java/Scala lambda expressions cannot provide such efficient operations.

I'm +1 for adding bindings to the main APIs we need, but we need to be careful about the goals of the project. Velox is not meant to be a user facing data processing library by itself; it is not meant to be an alternative for things like Pandas, numpy, DuckDB and similar. The idea is that such projects could be built by leveraging Velox.

So, in my opinion, if the goal is to provide a very thin wrapper to expose Velox's main API in Java, this would be part of Velox (like PyVelox, although that is still a prototype). If the plan is to provide an API for end users (data scientists and such) so that they could use Velox's power to build data pipelines (read parquet files, aggregate data, similar to Pandas), I still think it's a very useful and compelling thing to be built, but it sounds like a different project whatsoever.

pedroerp commented 4 days ago

Now, regarding the actual APIs to be exposed. In my view, the main value Velox provides is a way to very efficiently execute query plans. In this sense, what an external API should comprise, in this order:

  1. The query plan representation itself (plan nodes).
  2. A way to execute it (tasks).

It happens that many plan nodes take expressions, so we will also need:

  1. A way to represent expressions (expression nodes). One could argue that this is already part of 1.

Optionally, we could expose through the API a way to evaluate expressions as a standalone API independent of query plans:

  1. Expression evaluation (exprSet).

Similarly, some of this API will rely on the actual data representation, so we will probably need to expose an API for Vectors (similar to Arrow).

  1. Data representation API (vectors, types)

I think it even make sense to tackle things in this order (although 1. is arguably the hardest to expose), since Velox's main value is to execute plans, not to represent data (the latter could also be done using something like Arrow).

@mbasmanova @xiaoxmeng @oerling @Yuhta @majetideepak , thoughts?