Closed JuanchobananoAalto closed 2 months ago
Hi @JuanchobananoAalto ,
Thanks for the great description of your issue.
If you run the simulation with limited memory and face OOM issues, you can reduce the number of workers.
This can be done by adding the following line to the script:
worker.threads_per_node=X
where X
is the number of workers that are run in parallel. By default, ray will set this to the number of CPUs on your system. You can set it to a lower value to reduce the memory demand.
Ray does indeed create logs for each worker. You can find them in the experiment folder, e.g. ~/nuplan/exp/exp/sim_plancnn/2024.07.03.08.02.00/logs
. You need to check the output of the main job to find the worker that died so that you can check the corresponding log - However, if it just went out of memory, there might be limited information.
I am closing this for now. Let me know if you have any further questions.
I am encountering errors when running the
sim_plancnn.sh
script. The errors indicate that Ray workers are dying unexpectedly due to a system error, most likely caused by memory pressure (OOM).Error Messages:
These messages are followed by details about the specific worker IDs, PIDs, and exit types, which suggest a system error. Additionally, the log mentions:
This confirms that the worker failures are related to memory limitations.
Potential Causes:
Steps Taken:
Request:
sim_plancnn.sh
script to reduce memory usage or adjust Ray worker configurations to prevent OOM errors?