Closed jank324 closed 7 months ago
This looks like a classical "beam and element on different devices" error.
We can
The public Github-hosted runners don't have GPUs, it seems that this is planned in the future though
Automatically running GPU nodes would obviously be the coolest, but maybe the pragmatic approach to avoid these problems in the future (for now) would be to have a PR template with tasks and make one of them something like "Run pytest on GPU node just before merge"?
... have a PR template with tasks and make one of them something like "Run pytest on GPU node just before merge"
That sounds like a reasonable short-term solution!
Okay ... I will add it as part of this fix
I have the similar problem with you. Do you have any recommendation for me. I deployed my program on the HPC. After 10mins running, it shows the error. I have no idea how to fix it. The problem is as follow.
Process ForkServerProcess-2: Process ForkServerProcess-1: Process ForkServerProcess-3: Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer ./wandb_server.sh: line 12: 142015 Terminated
I have the similar problem with you. Do you have any recommendation for me. I deployed my program on the HPC. After 10mins running, it shows the error. I have no idea how to fix it. The problem is as follow.
Process ForkServerProcess-2: Process ForkServerProcess-1: Process ForkServerProcess-3: Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer ./wandb_server.sh: line 12: 142015 Terminated
Are you using Cheetah? This looks like it's not related to Cheetah, but rather to how you are using Stable Baselines3`s vectorised environments.
I have fixed. It is the way I deploy. I use ./wandb.sh. It will lead problem. Use sbatch wandb.sh is ok.
---Original--- From: "Jan @.> Date: Fri, Dec 29, 2023 15:52 PM To: @.>; Cc: @.**@.>; Subject: Re: [desy-ml/cheetah] Issues when running on a machine with CUDA GPUs(Issue #87)
I have the similar problem with you. Do you have any recommendation for me. I deployed my program on the HPC. After 10mins running, it shows the error. I have no idea how to fix it. The problem is as follow.
Process ForkServerProcess-2: Process ForkServerProcess-1: Process ForkServerProcess-3: Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(self._args, self._kwargs) File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 33, in _worker cmd, data = remote.recv() ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/home/ma310272/anaconda3/envs/pybamm_env/lib/python3.11/multiprocessing/connection.py", line 378, in _recv chunk = read(handle, remaining) ^^^^^^^^^^^^^^^^^^^^^^^ ConnectionResetError: [Errno 104] Connection reset by peer ./wandb_server.sh: line 12: 142015 Terminated
Are you using Cheetah? This looks like it's not related to Cheetah, but rather to how you are using Stable Baselines3`s vectorised environments.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
I've just tried using the new Cheetah version on a cluster node with GPUs and it crashed (dump below). We haven't really tested the scenario of GPUs being present. We absolutely should. I don't know if there is any way we could even integrate this in the GitHub Actions.