facebookresearch / ReAgent

A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)
https://reagent.ai
BSD 3-Clause "New" or "Revised" License
3.57k stars 521 forks source link

How to configure to run the examples on the CPU? #111

Closed ArmenLevoni closed 4 years ago

ArmenLevoni commented 5 years ago

I followed instructions from here: https://github.com/facebookresearch/Horizon/blob/master/docs/installation.md to run Docker image on Mac. However when I am running the example, getting following error:


root@cb58ca621d80:~/Horizon/Horizon# python ml/rl/test/gym/run_gym.py -p ml/rl/test/gym/discrete_dqn_cartpole_v0.json
INFO:__main__:Running gym with params
INFO:__main__:{'env': 'CartPole-v0', 'model_type': 'pytorch_discrete_dqn', 'max_replay_memory_size': 10000, 'use_gpu': False, 'rl': {'gamma': 0.99, 'target_update_rate': 0.1, 'reward_burnin': 1, 'maxq_learning': 1, 'epsilon': 0.05, 'temperature': 0.35, 'softmax_policy': 0}, 'rainbow': {'double_q_learning': False, 'dueling_architecture': False}, 'training': {'layers': [-1, 128, 64, -1], 'activations': ['relu', 'relu', 'linear'], 'minibatch_size': 1024, 'learning_rate': 0.001, 'optimizer': 'ADAM', 'lr_decay': 0.999, 'use_noisy_linear_layers': False}, 'run_details': {'num_episodes': 200, 'max_steps': 200, 'train_every_ts': 1, 'train_after_ts': 1, 'test_every_ts': 2000, 'test_after_ts': 1, 'num_train_batches': 1, 'avg_over_num_episodes': 100, 'offline_train_epochs': 30}}
INFO:ml.rl.training.rl_trainer_pytorch:CUDA availability: False
INFO:ml.rl.training.rl_trainer_pytorch:NOT Using GPU: GPU not requested or not available.
Traceback (most recent call last):
  File "ml/rl/test/gym/run_gym.py", line 850, in <module>
    main(args[1:])
  File "ml/rl/test/gym/run_gym.py", line 564, in main
    args.path_to_pickled_transitions,
  File "ml/rl/test/gym/run_gym.py", line 632, in run_gym
    path_to_pickled_transitions=path_to_pickled_transitions,
  File "ml/rl/test/gym/run_gym.py", line 155, in train
    stop_training_after_solved,
  File "ml/rl/test/gym/run_gym.py", line 428, in train_gym_online_rl
    trainer.train(samples)
  File "/home/Horizon/Horizon/ml/rl/training/dqn_trainer.py", line 150, in train
    loss.backward()
  File "/home/miniconda/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/miniconda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUDA driver version is insufficient for CUDA runtime version

How can I configure to run the example on the CPU?

MisterTea commented 5 years ago

@ArmenLevoni This looks like a problem with pytorch, where even though we aren't using GPU, it still fails because CUDA is too old. One fix is to update your CUDA drivers or uninstall CUDA (use the non-CUDA docker). But pytorch also shouldn't be checking the CUDA driver when gpu mode is turned off, so you could verify that the pytorch examples fail and submit a bug to them if you want to chase it down: https://github.com/pytorch/examples/blob/master/mnist/main.py

MisterTea commented 5 years ago

Whoops, clicked the wrong button. Feel free to reply or close if the issue is done, thanks.

ArmenLevoni commented 5 years ago

Thanks, @MisterTea, I tried to update CUDA drivers, I build both versions docker docker/cpu.Dockerfile and cuda.Dockerfile, results are the same. I run these on MacOs Mojave. Pytoarch example does not work:

from torchvision import datasets, transforms Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'torchvision'

ArmenLevoni commented 5 years ago

Actually after installing: conda install torchvision it updates pytorch to: pytorch-1.0.1 |py3.6_cuda9.0.176_cudnn7.4.2_2 320.5 MB pytorch

and it starts working. Most probably pytorch-nightly (from requirements.txt) is broken.

Later on, with pytorch-1.0.1 version, it fails in this stage: ~/Horizon/Horizon# python ml/rl/workflow/dqn_workflow.py -p ml/rl/workflow/sample_configs/discrete_action/dqn_example.json getting error:

  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 11, in <module>
    from torch._six import queue
ImportError: cannot import name 'queue'

In the: https://github.com/facebookresearch/Horizon/blob/master/requirements.txt can pytorch-nightly be replaced with a correct version that the provided examples will work with?

MisterTea commented 5 years ago

Let me see if I can keep pytorch up to date in our integration tests so we can catch these errors...