bartosz25 / spark-scala-playground

Sample processing code using Spark 2.1+ and Scala
50 stars 25 forks source link

Spark SQL Codegen: Why #17

Closed cozos closed 4 years ago

cozos commented 4 years ago

Hi,

Just read this: https://www.waitingforcode.com/apache-spark-sql/who-when-how-what-apache-spark-sql-code-generation/read

Great article. I was wondering if you could cover the why of code generation. Specifically, how does code generation improve performance?

Thanks

bartosz25 commented 4 years ago

Hi @cozos

Thanks for the suggestion, it's a great idea! I've just added it to my backlog. I have some work to do before but I hope to prepare something by the end of October (the worst case).

Meantime, I will keep the issue Open and close it with the link to this new article.

Bartosz.

cozos commented 4 years ago

That's great to hear!

I've read these articles:

https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

And from what I can tell, the gist of why code generation improves performance is:

Anyway, it'd be great to hear your thoughts on the matter.

Cheers

bartosz25 commented 4 years ago

Hi @cozos

The 3 points you gave here are pretty fine. I published a post about the "why" with a deeper view for them. And since globally it's all about JIT optimizations, I'll publish a follow-up post next week about different operations that the JIT can do in order to optimize the code execution.

You can find the post here: https://www.waitingforcode.com/apache-spark-sql/why-code-generation-apache-spark-sql/read

If it's fine for you, I will close the issue.

Best regards, Bartosz.

cozos commented 4 years ago

Looks amazing, thank you very much!