Closed hendrikmakait closed 8 months ago
@milesgranger: Any ideas if it's possible for us to assert the correct number of Spark workers?
Hrmm, I'm almost inclined to think that this is kinda on Spark for losing and not recovering workers. But otherwise, I'd think one can check with sc.getConf().get("spark.executor.instances")
(as we looked into this config setting) but I don't know if that's dynamic?
Edit: actually that, and another way explained here: https://stackoverflow.com/questions/38660907/how-to-get-the-number-of-workersexecutors-in-pyspark
I can poke around at this tomorrow if that'd be helpful.
Hrmm, I'm almost inclined to think that this is kinda on Spark for losing and not recovering workers.
I'd generally side with you, but I'm not entirely sure if it's on us (our cluster configuration/setup) that those workers drop without healing.
I can poke around at this tomorrow if that'd be helpful.
If it's quick to do, that would be nice!
From what I see, when the Spark worker exits, dask nanny/worker is also getting killed, so the VM shuts itself down. Is that what other folks are seeing?
From what I see, when the Spark worker exits, dask nanny/worker is also getting killed, so the VM shuts itself down. Is that what other folks are seeing?
10.0.27.248 is interesting. It starts and seems perfectly happy and then scheduler kills it due to heartbeat timeout. While according to the (spark) scheduler, it's processing tasks fine less than 10sec before it's killed. Maybe the answer at least to this case is to bump/remove Dask heartbeat timeout for TPC-H/Spark clusters?
Checking back in for counting the number of executors, indeed conf.get("spark.executor.instances")
will not change over the course of the cluster.
However, with the Spark REST API I've confirmed that /applications/[app-id]/executors
works well to get the current count of active executors.
Only need to have port 4040 opened up by default or at least a flag as it's closed now closed. I opened in the linked PR along w/ the required other ports, but I'd guess platform has a better way to deal with this. @ntabris
For now I have this implementation here for suggestions: https://github.com/coiled/benchmarks/pull/1450
@hendrikmakait, I was curious if https://github.com/coiled/benchmarks/pull/1450 closes this, now that it tracks/warns about change in executors. But maybe we still want an assert somewhere?
But maybe we still want an assert somewhere?
I was asking myself the same question. Maybe an assertion makes more sense? Otherwise, I fear we won't really notice the warnings.
In this run, the PySpark cluster lost three workers over time without us ever detecting let alone healing this. Things went wrong for PySpark and likely skewed test results afterward.
At the very least, we should be able to assert that a PySpark cluster still has the requested # of workers when beginning a new query.
Cluster: https://cloud.coiled.io/clusters/410048/ CI Run: https://github.com/coiled/benchmarks/actions/runs/8231277279/job/22506263406#step:8:745