bartosz25 / spark-scala-playground

Sample processing code using Spark 2.1+ and Scala
50 stars 25 forks source link

Why spark generate Java code and not scala code? #18

Open igreenfield opened 4 years ago

bartosz25 commented 4 years ago

Thank you @igreenfield for such an amazing question! I was looking for the reasons in the documentation and old PR but found any information about that. I've just posted a question on Spark users group. You can follow the conversation on https://mail-archives.apache.org/mod_mbox/spark-user/201911.mbox/browser or if not, I'll keep you up to date on this Issue.

Cheers, Bartosz.

igreenfield commented 4 years ago

@bartosz25 I was looking into the code generation phase and I think that if the code was scala it was easier to reduce the number of code line so many cases of compilation failed due to method grows more then 64KB will disappear.

bartosz25 commented 4 years ago

Hi @igreenfield ,

I've some answers from the mailing list:

Long story short, it's all about the compilation performance :)

Regarding your point about 64KB limitation, AFAIK, Spark has a protection against too long methods. First, it's able to split too long function into multiple methods (spark.sql.codegen.methodSplitThreshold). Second, it's also able to desactivate codegen to handle the JVM max method length limit (spark.sql.codegen.hugeMethodLimit).

Did you already have some issues about "too long" generated method which made your pipeline fail? I've never experienced that so I'm really curious to learn new things and maybe help you to overcome the issue by reworking the code?

igreenfield commented 4 years ago

Hi @bartosz25 First thanks for the help!!

  1. The compilation performance could be eliminated using a compile server.
  2. Yes, I hit the 64KB limit all the time. my use case is very complex: we are migrating SQL engine into spark. (most cases processNext method) for example

we can schedule a call and I can explain in more details.

another thing, one of the answers:

Also for low level code we can’t use (due to perf concerns) any of the edges scala has over java, eg we can’t use the Scala collection library, functional programming, map/flatMap. So using scala doesn’t really buy anything even if there is no compilation speed concerns.

I think the ability to return more than one object from a function can do the different in splitting the huge methods into smaller ones.

bartosz25 commented 4 years ago

Re @igreenfield

At that moment I don't have much time so I won't be able to help you. Sorry for that, late January it should be better. Meantime, maybe you can take a look at my series about Apache Spark customization. I cover them how to alter logical and physical plans, how to add a new parser and so forth. Maybe with that you can write your own code generation which will be much shorter than the code you've just shown me. The articles were published here: https://www.waitingforcode.com/tags/spark-sql-customization

Anyway, I doubt that Spark community agrees on switching code generation to Scala because of a single demand. But you can always take a try and ask directly on the mailing list https://spark.apache.org/community.html

Cheers, Bartosz.

igreenfield commented 4 years ago

Hi, @bartosz25 Thanks! I will be in touch with you in late January.