apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
547 stars 105 forks source link

Planning to publish Roadmap? #19

Open okue opened 3 months ago

okue commented 3 months ago

Hi, I would like to inquire about a roadmap for Comet.

I am currently exploring the option of replacing Spark workloads with native engines, and I'm keen to understand the future plans and scope of this plugin. Any information on the roadmap would be greatly appreciated.

Thank you!

sunchao commented 3 months ago

Thanks @okue . Yes, will add a roadmap into doc soon. We used to have it internally.

alamb commented 3 months ago

Maybe anyone else who is interested in using / contributing to comet could use this ticket to explain their use case / any features they are interested in helping with

milenkovicm commented 3 months ago

I'd like to put a suggestion, based on my experience a lot of production spark workloads have some kind of UDF, good chunk of those UDFs are very simple and can be expressed as SQL expressions.

I assume that in case UDF is used, comet will fall back to classic spark execution, which might not be optimal (I might be wrong, apologise if I am, I did not do my homework to check comet code in depth). My suggestion is to consider adding functionality like https://nvidia.github.io/spark-rapids/docs/additional-functionality/udf-to-catalyst-expressions.html which can speed up UDF in comet case as well.

I believe there is nothing GPU specific in that code, and it can be reused, just not sure what would be the best approach.

Maybe @andygrove would be able to help

dbtsai commented 3 months ago

Five years ago, we were delving into bytecode analysis for similar purposes at Apple, as detailed in this video: https://www.youtube.com/watch?v=FWg8iF34RJw. Our implementation differed from Nvidia's, and we encountered many edge cases that didn't work. Consequently, we halted further investment in this area. Nonetheless, we remain optimistic that we can refine the process to reliably handle simpler use-cases, prompting us to consider revisiting this approach.

milenkovicm commented 3 months ago

Five years ago, we were delving into bytecode analysis for similar purposes at Apple, as detailed in this video: https://www.youtube.com/watch?v=FWg8iF34RJw. Our implementation differed from Nvidia's, and we encountered many edge cases that didn't work. Consequently, we halted further investment in this area. Nonetheless, we remain optimistic that we can refine the process to reliably handle simpler use-cases, prompting us to consider revisiting this approach.

Thanks for sharing @dbtsai!

From my experience some projects mostly have "simple" udfs others may have mostly complex functions. It'll be hard to get it right for all of them. Simple functions usually deal with data validation and they may be called all over the place, so speeding them up may make sense

sunchao commented 3 months ago

To add on what @dbtsai said, this is the original PR that @aokolnychyi opened. Some of the discussions in the PR are still worthwhile to read.

Also cc @jlowe @andygrove @abellina @tgravescs @winningsix as we have went through exact the same topic lately.

alamb commented 3 months ago

Another perhaps interesting read from the literature is:

"Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" - https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

The findings of the paper is that Vectorized evaluation (aka what we use in DataFusion) and compiled evaluation offer similar performance for most analytic style workloads (though each is better at certain cases than the otehr)

The reason I think JIT / compiled engines are less common than vectorized engines is the software engineering challenges -- that they are often harder to maintain and debug when something goes wrong compared to "normal" vectoized code.

milenkovicm commented 3 months ago

Another perhaps interesting read from the literature is:

"Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" - https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf

The findings of the paper is that Vectorized evaluation (aka what we use in DataFusion) and compiled evaluation offer similar performance for most analytic style workloads (though each is better at certain cases than the otehr)

The reason I think JIT / compiled engines are less common than vectorized engines is the software engineering challenges -- that they are often harder to maintain and debug when something goes wrong compared to "normal" vectoized code.

I believe authors of "Photon: A Fast Query Engine for Lakehouse Systems" came with similar conclusion.

senordeveloper commented 2 months ago

Is there a plan to integrate native Iceberg support? Like https://github.com/apache/iceberg-rust

sunchao commented 2 months ago

@senordeveloper Yes, we do have plan to integrate with it, but probably sometime in future given the project is still new. It will require some work given our current Parquet reader implementation is a hybrid one with IO & decompression done at the JVM side while decoding at the native side. We have plan to move to a fully native implementation like the one in arrow-rs. After that, it should be much easier to integrate with iceberg-rs or delta-rs.