exacaster / lighter

REST API for Apache Spark on K8S or YARN
MIT License
91 stars 21 forks source link

Session creation is slow #607

Closed julienlau closed 1 year ago

julienlau commented 1 year ago

Hi,

When running small job through lighter, I observed that the scheduling of the job by lighter takes time. I use only Session.

If I run spark-pi with args=1 the single spark stage takes 0.1ms. If I run it directly trough a spark-submit batch on my k8s cluster it takes 23s to complete.

However, If I have to create a session with lighter, it takes ~90s before the session is in state="STARTED" or IDLE.

Is it due to micronauts scheduling things every minutes ? Something else ? Is it tunable ?

In addition, If I "Create a session and then wait for status = STARTED/IDLE and then submit a statement" ~ 90s

it is way slower than: "Create a session and submit a statement on the new session not started yet and wait for statement state to be available" ~ 75s

Thanks and regards

julienlau commented 1 year ago

image

pdambrauskas commented 1 year ago

Lighter uses scheduled functions to launch and track sessions:

Currently these values are hardcoded.

julienlau commented 1 year ago

Hi,

Thanks for confirming this ! Regarding the 2 minutes on status, would you consider it very harmful to decrease it to 1min ?

I found out that a good hack to bypass this:

Regards

Minutis commented 1 year ago

If I recall correctly, the 2 minutes status check is for all jobs, not only sessions but also batch jobs. It was increased from 1 minute to 2 minutes since it can take a considerable amount of time to check all jobs if there is a lot of running at a given moment. In that case there would be issues with the job check. So to simply change the default might not be the best approach here. @pdambrauskas any thoughts?

pdambrauskas commented 1 year ago

any thoughts?

I think it would make sense to have these interval configurable.

Since statements are stored in DB, I expect this is OK. Don't you think there is a risk of a statement being lost ?

In some cases, when session fails to start, your statements could be stuck in "waiting" status. So you need have this in mind if you choose this approach.