RuntimeError: DataLoader worker (pid 6370) is killed by signal: Killed.

I am trying this very good work of yours using wsl2 based Ubuntu 22.04. The training task is cartpole_swingup_sparse.

The problem I am having is when my training runs here:

| train | F: 172000 | S: 86000 | E: 172 | L: 1000 | R: 0.0000 | BS: 86000 | FPS: 10.8101 | T: 4:48:13

The program is interrupted by the system for this reason:

Error executing job with overrides: ['task=cartpole_swingup_sparse']
Traceback (most recent call last):
  File "train.py", line 310, in <module>
    main()
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/main.py", line 49, in decorated_main
    _run_hydra(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 305, in main
    workspace.train()
  File "/home/snoplx/Projects/mimex/mimex-dmc/train.py", line 245, in train
    metrics = self.agent.update(self.replay_iter, self.global_step)
  File "/home/snoplx/Projects/mimex/mimex-dmc/drqv2.py", line 388, in update
    obs = self.aug(obs.float())
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/snoplx/Projects/mimex/mimex-dmc/drqv2.py", line 42, in forward
    shift = torch.randint(0,
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6370) is killed by signal: Killed.

The guess I got after checking the web about the problem is a resource limitation of the hardware device. But exactly which resource shortage leaves me clueless. Is it possible to extract more valid information from the error message?

Another question is that R is still 0 at E: 172, is this normal? And I'm trying to find the num_train_frames setting in the config file, are you using the default value?

ToruOwO / mimex

RuntimeError: DataLoader worker (pid 6370) is killed by signal: Killed. #2