exacaster / lighter

REST API for Apache Spark on K8S or YARN
MIT License
91 stars 21 forks source link

Running lighter locally #682

Closed jmilkiewicz closed 1 year ago

jmilkiewicz commented 1 year ago

Looking at the source code i see there is a LocalBackend. I am wondering if it is possible to run lighter locally (as a regular java process on localhost) or as a docker container without YARN and then simply create a session and submit python statements against it ?

If it is not possible without YARN do you have any recipe/instruction how to run yarn locally so it can work smoothly with lighter?

I would like to use it this kind of setup much more like a playground for myself.

pdambrauskas commented 1 year ago

LocalBackend is meant for playing/testing Lighter on dev environment, If you want, you can use it. This backend implementation should be used by default, if you start lighter without any custom configurations.

In case you use it for python sessions you might need to build your own Docker image with python included in it, since we do not include Python on the images we are pushing to our registry.

In case you run Lighter locally, it should work fine as long as your local machine has Python installed.

jmilkiewicz commented 1 year ago

Hey Thx a lot. I managed to start the app locally from gradlew (not sure why it does not wanna to start from Intellij). Given running it locally: do you have some working submitParams which i can use to start session. For now on session created i see error:

Spark home not found; set it explicitly or use the SPARK_HOME environment variable.

I did downloaded spark, set SPARK_HOME but till now i can not create session - it always fails but no errors in logs. I am trying to debug org.apache.spark.launcher.SparkLauncher to find out what's happening

pdambrauskas commented 1 year ago

We have tried it with batch applications and it seemed to work (https://github.com/exacaster/lighter/blob/master/dev/README.md)

jmilkiewicz commented 1 year ago

Yes, these minimal params work for batch but not for session :(. I wanna debug more to see what is the reason... The worst is that i see nothing in logs (standard out)

jmilkiewicz commented 1 year ago

Ok... seems like i found what is the issue. Seems like in local setup no ENV variables are sent to shell_wrapper.py... For Yarn and k8s these are set via spark.kubernetes.driverEnv.* or spark.yarn.appMasterEnv.*

pdambrauskas commented 1 year ago

Oh, i see, it would not be a problem to set PY_GATEWAY_PORT & PY_GATEWAY_HOST globally, but I'm not sure what we could do with LIGHTER_SESSION_ID. We'll try to think of something, to make sessions work on local mode and will let you know. Thanks for debugging it.

jmilkiewicz commented 1 year ago

TBH i solved it already on my branch... i can share the code later if u wanna. I have simply added:

pdambrauskas commented 1 year ago

It would be great if you could create a Pull Request with your changes. Or at least drop a link to your branch, we'll try to incorporate your ideas :)

jmilkiewicz commented 1 year ago

I would like definitely to help and support you guys with that... The scenario i am trying to implement is to run python code in lighter session. Unfortunately it is not "just calculate pi number" code but code that imports some external libraries. These external libraries needs to be fed some values via environment properties so the whole idea is that sooner or later i will need to dig deeper into it. Basically I am talking about 2 scenarios:

pdambrauskas commented 1 year ago

I've created a Pull Request to support setting env variables on session creation.

Regarding setting env variables on statement creation - environment should not be mutated after you create the session. You should avoid doing that.

Keeping that in mind, if you really need to do it, os.environ['MY_VAR'] = 'my_value' should work.

jmilkiewicz commented 1 year ago

Perfect. Sounds really good. To give you a bigger context: i am planning to use lighter sessions to execute workflow of tasks, not just some tasks. Given that i need to keep some kind of context, ie the fact that a statement belongs to given workflow. Context is traced by env variables so i will not avoid setting env variables on a statement level