Memory issues on algo-1

I noticed that the algo-1 host was using a lot more memory than the other nodes. At some point, our algo-1 was running out of memory and eventually crashing. Screenshot from 2022-10-10 14-13-01 I figured that algo-1 was the driver, so I tried to dedicate one instance for the driver by reducing the number of executors from 12 to 11, knowing that the job is running on 12 nodes.

# /opt/ml/processing/input/conf/configuration.json

[{
    "Classification": "spark-defaults",
    "Properties": { "spark.executor.instances": "11" } # 12 - 1 for the driver
}]

Sometimes, it does work as intended: the driver does some work at the beginning and then just check on the worker nodes. It's a bit of a waste since our worker doesn't do anything else but it works. Screenshot from 2022-10-10 15-46-39

However, sometimes, Spark prefers to have one node on IDLE, and one node with both the driver and the executor, as before. It defeats the purpose of my configuration. (here we had 10 nodes instead of 12) Screenshot from 2022-10-17 16-13-48

Do you have any idea how this issue could be solved?

aws / sagemaker-spark-container

Memory issues on algo-1 #100