ToruOwO / mimex

PyTorch implementation for all methods and environments in the paper "MIMEx: Intrinsic Rewards from Masked Input Modeling"
16 stars 1 forks source link

RuntimeError: DataLoader worker (pid 6370) is killed by signal: Killed. #2

Closed 0uroboro5 closed 1 month ago

0uroboro5 commented 1 month ago

I am trying this very good work of yours using wsl2 based Ubuntu 22.04. The training task is cartpole_swingup_sparse.

The problem I am having is when my training runs here:

| train | F: 172000 | S: 86000 | E: 172 | L: 1000 | R: 0.0000 | BS: 86000 | FPS: 10.8101 | T: 4:48:13

The program is interrupted by the system for this reason:

Error executing job with overrides: ['task=cartpole_swingup_sparse']
Traceback (most recent call last):
  File "train.py", line 310, in <module>
    main()
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/main.py", line 49, in decorated_main
    _run_hydra(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 305, in main
    workspace.train()
  File "/home/snoplx/Projects/mimex/mimex-dmc/train.py", line 245, in train
    metrics = self.agent.update(self.replay_iter, self.global_step)
  File "/home/snoplx/Projects/mimex/mimex-dmc/drqv2.py", line 388, in update
    obs = self.aug(obs.float())
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/snoplx/Projects/mimex/mimex-dmc/drqv2.py", line 42, in forward
    shift = torch.randint(0,
  File "/home/snoplx/anaconda3/envs/mimex-dmc/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6370) is killed by signal: Killed.

The guess I got after checking the web about the problem is a resource limitation of the hardware device. But exactly which resource shortage leaves me clueless. Is it possible to extract more valid information from the error message?

Another question is that R is still 0 at E: 172, is this normal? And I'm trying to find the num_train_frames setting in the config file, are you using the default value?

ToruOwO commented 1 month ago

Thanks for the interest!

Unfortunately, I have never run into this error before so don't have much clue either. I'd suggest try reducing the batch size to see if it helps.

For policy training, I don't recall exactly but it's possible that reward is low even after long training since the task reward is very sparse. Since the default config does not use our MIMEx exploration module, I'd suggest try sweeping a few exploration config (by changing this setting to this).

Hope this helps!