autonomousvision / tuplan_garage

[CoRL'23] Parting with Misconceptions about Learning-based Vehicle Motion Planning
Other
489 stars 55 forks source link

Ray Workers Failing Due to Memory Pressure (OOM) in sim_plancnn.sh Script #50

Closed JuanchobananoAalto closed 2 months ago

JuanchobananoAalto commented 2 months ago

I am encountering errors when running the sim_plancnn.sh script. The errors indicate that Ray workers are dying unexpectedly due to a system error, most likely caused by memory pressure (OOM).

Error Messages:

(raylet) A worker died or was killed while executing a task by an unexpected system error...

These messages are followed by details about the specific worker IDs, PIDs, and exit types, which suggest a system error. Additionally, the log mentions:

(raylet) 14 Workers (tasks / actors) killed due to memory pressure (OOM)...

This confirms that the worker failures are related to memory limitations.

Potential Causes:

Steps Taken:

Request:

mh0797 commented 2 months ago

Hi @JuanchobananoAalto , Thanks for the great description of your issue. If you run the simulation with limited memory and face OOM issues, you can reduce the number of workers. This can be done by adding the following line to the script: worker.threads_per_node=X where X is the number of workers that are run in parallel. By default, ray will set this to the number of CPUs on your system. You can set it to a lower value to reduce the memory demand.

Ray does indeed create logs for each worker. You can find them in the experiment folder, e.g. ~/nuplan/exp/exp/sim_plancnn/2024.07.03.08.02.00/logs. You need to check the output of the main job to find the worker that died so that you can check the corresponding log - However, if it just went out of memory, there might be limited information.

I am closing this for now. Let me know if you have any further questions.