running two instances of spark-perf ?

msifalakis commented 8 years ago

I have been running 1 instance of the spark-perf benchmark using the core tests and the MLlib tests on a small-ish cluster (only 4 nodes, yet quite poweful ones -- 64GB ram, 8 cores each), using scale-factor=1. The benchmark is occupying just 4 executors (1 core each).

Now I ve tried to launch a second scaled-down (0.1) configuration of the benchmark suite using only the core tests.. at the same time. Although there are both memory and executor/cores available, the benchmark fails to start! (or more precisely it fails to engage workers! .. giving me the following message "Spark is still running on some slaves ... sleeping for 10 seconds"). That is even though I have set the USE_CLUSTER_SPARK = True, and RESTART_SPARK_CLUSTER = False -- so I guess it tries to use my existing cluster

On the other hand if I start a spark-shell or start another appl it seems to get admitted just fine!

Any ideas of what this means ? .. Given the very spartan information about what the benchmark does/uses it is rather difficult to know which direction to start looking at.

TIA

Manolis.

JoshRosen commented 8 years ago

Hi @msifalakis,

My hunch is that this is a longstanding bug.

It wouldn't surprise me if nobody has tried running two instances of spark-perf at the same time on the same set of machines or on machines which run other non-spark-perf clusters. Whenever I've run this benchmark, I've done it on a dedicated set of EC2 machines which only run the spark-perf cluster.

If you'd like to try to fix this yourself, here's a few starting points:

The log message that you saw comes from ensure_spark_stopped_on_slaves, which seems to search for any executor backend, not just ones that belong to the spark-perf cluster: https://github.com/databricks/spark-perf/blob/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/cluster.py#L46
In your case, this message probably came from https://github.com/databricks/spark-perf/blame/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/testsuites.py#L79. I believe that the intent behind that call was to ensure that executors from one test driver were cleaned up before beginning a new test, hopefully ensuring that subsequent tests are able to obtain the full number of executors.

The right fix is probably to figure out how to only monitor shutdown of executors associated with the previous test run, but this could be tricky to do.

msifalakis commented 8 years ago

Hello Josh

Thanks for the pointers! Mostly helpful. I will have a look at them.

Meanwhile there is another relevant question I have, to which you may be able to provide some quick pointers. Having looked at different parts of the benchmark, i have only seen reported completion times (well not exactly "the only" but mostly related to algorithm completion/accuracy). Are there any options or places where one can dig in to find/enable system metrics such as mem I/O, disk I/O, network bandwidth utilisation, CPU occupancy, for individual tests/applications?

thanks again for the pointers and any further suggestions

Manolis.

From: Josh Rosen notifications@github.com To: databricks/spark-perf spark-perf@noreply.github.com Cc: msifalakis emm@zurich.ibm.com Date: 30/11/2015 21:12 Subject: Re: [spark-perf] running two instances of spark-perf ? (#91)

Hi @msifalakis, My hunch is that this is a longstanding bug. It wouldn't surprise me if nobody has tried running two instances of spark-perf at the same time on the same set of machines or on machines which run other non-spark-perf clusters. Whenever I've run this benchmark, I've done it on a dedicated set of EC2 machines which only run the spark-perf cluster. If you'd like to try to fix this yourself, here's a few starting points: The log message that you saw comes from ensure_spark_stopped_on_slaves, which seems to search for any executor backend, not just ones that belong to the spark-perf cluster: https://github.com/databricks/spark-perf/blob/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/cluster.py#L46

In your case, this message probably came from https://github.com/databricks/spark-perf/blame/79f8cfa6494e99a63f7cd4502aea4660b72ff6da/lib/sparkperf/testsuites.py#L79 . I believe that the intent behind that call was to ensure that executors from one test driver were cleaned up before beginning a new test, hopefully ensuring that subsequent tests are able to obtain the full number of executors. The right fix is probably to figure out how to only monitor shutdown of executors associated with the previous test run, but this could be tricky to do. ? Reply to this email directly or view it on GitHub.

JoshRosen commented 8 years ago

spark-perf itself does not contain support for collection of compute resource utilization metrics (memory, CPU, I/O). spark-ec2 clusters are launched with Ganglia installed, so it should be possible to pull metrics from there.

databricks / spark-perf

running two instances of spark-perf ? #91