Closed limengzhaolihai closed 8 months ago
The first error may be a cuda error but I confirmed that cuda is available , is it because of this tpch.yaml configuration setting is wrong, the second error looks like an operation using multiprocess communication with the error as file not found, if you can answer that I would be very grateful!
results = [conn.recv() for conn in self.conns] I've pinpointed what seems to be the problem with this code
Hi @limengzhaolihai, thanks for the heads up. The first error can be fixed by changing the device
in the yaml file, e.g. to device: 'cuda:0'
or device: 'cpu'
. I believe the second error occurs because the main process terminates before the subprocesses do. Does this problem still occur once you change devices?
Edit: also, I've just made some updates to the repo. I encourage you to pull the new changes and also read the README to ensure you're installing everything as recommended.
Thank you very much for your answer, I will continue to make attempts based on your suggestions and may get back to you slightly later.
Hi author, problem solved, thank you very much!
Can you please tell me some details about your training, what is the problem with these following errors during training and is there any way to fix them?
Traceback (most recent call last): File "/content/drive/MyDrive/spark-sched-sim-main/./train.py", line 7, in
make_trainer(cfg).train()
File "/content/drive/MyDrive/spark-sched-sim-main/trainers/trainer.py", line 105, in train
self.agent.actor.to(self.device, non_blocking=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last): File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild( state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "", line 1, in
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
FileNotFoundError: [Errno 2] No such file or directory
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 110, in setstate
self._semlock = _multiprocessing.SemLock._rebuild(state)
FileNotFoundError: [Errno 2] No such file or director