Recurrent policy, MazeS2 env

maximecb commented 5 years ago

Hi @MillionIntegrals. I was wondering, is the default model used by vel recurrent? If not, is there an example with a recurrent model.

I'm trying to train something on the MiniWorld-MazeS2-v0 env, which should be trivial, but can't get this to converge using the pytorch-a2c-ppo-acktr RL code. Would like to try it with vel.

MillionIntegrals commented 5 years ago

By default policies are simple feedforward networks, but RNN policies are supported, I'm actively expanding that area at the moment for more bells and whistles. Generally though I'm finding RNN policies harder to converge.

One example of LSTM policy I have is here: https://github.com/MillionIntegrals/vel-miniworld/blob/master/examples-configs/ppo/ppo_minigrid_doorkey_6x6_lstm.yaml

Additional example config you can check is this one for a2c: https://github.com/MillionIntegrals/vel/blob/master/examples-configs/rl/atari/a2c/pong_a2c_lstm.yaml

To make it work on MiniWorld-MazeS2-v0 you most likely need the following bits in the config:

model:
  name: vel.rl.models.policy_gradient_rnn_model

  backbone:
    name: vel.rl.models.backbone.nature_cnn_lstm
    input_width: 84
    input_height: 84
    input_channels: 3  # or 12, must be  3 * frame_history

In reinforcer section you also need to add:

  shuffle_transitions: off  # Required for RNN policies

I have some successes in training RNN policies with no frame_history, but generally converges faster with it. May be considered a bit of cheating though I guess ;)

Let me know if you manage to make it work, if you have any problems I'll try to test some example config.

maximecb commented 5 years ago

I tried the following config but ran into a strange error:

me: 'ppo_miniworld_maze_lstm'
multiprocessing: 'forkserver'  # Needed for OpenGL to properly initialize

env:
  name: vel_miniworld.env.miniworld
  envname: 'MiniWorld-MazeS2-v0'

vec_env:
  name: vel.rl.vecenv.subproc

model:
  name: vel.rl.models.policy_gradient_rnn_model

  backbone:
    name: vel_miniworld.model.minigrid_obs_lstm
    hidden_layers: [256]
    lstm_dim: 128
    activation: 'tanh'
    normalization: 'layer'

    input_width: 80
    input_height: 60
    input_channels: 3  # or 12, must be  3 * frame_history

reinforcer:
  name: vel.rl.reinforcers.on_policy_iteration_reinforcer

  algo:
    name: vel.rl.algo.policy_gradient.ppo

    entropy_coefficient: 0.01
    value_coefficient: 0.5

    max_grad_norm: 0.5 # Gradient clipping parameter

    cliprange:
      name: vel.schedules.linear
      initial_value: 0.1
      final_value: 0.0

  env_roller:
    name: vel.rl.env_roller.vec.step_env_roller
    gae_lambda: 0.95 # Generalized Advantage Estimator Lambda parameter
    number_of_steps: 128 # How many environment steps go into a single batch

  parallel_envs: 4 # How many environments to run in parallel
  batch_size: 128 # How many samples can go into the model once
  experience_replay: 4 # How many times to replay the experience

  discount_factor: 0.99 # Discount factor for the rewards

  shuffle_transitions: off  # Required for RNN policies

optimizer:
  name: vel.optimizers.adam
  lr: 2.5e-4
  epsilon: 1.0e-5

scheduler:
  name: vel.scheduler.linear_batch_scaler

commands:
  train:
    name: vel.rl.commands.rl_train_command
    total_frames: 1.0e6
    batches_per_epoch: 10

  record:
    name: vel.rl.commands.record_movie_command
    takes: 10
    videoname: 'ppo_miniworld_maze_lstm_vid_{:04}.avi'
#    frame_history: 4
    sample_args:
      argmax_sampling: true

The error produced is:

/pytorch/aten/src/THC/THCTensorScatterGather.cu:176: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [5,0,0], thread: [511,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/maxime/Desktop/vel/vel/launcher.py", line 70, in <module>
    main()
  File "/home/maxime/Desktop/vel/vel/launcher.py", line 64, in main
    model_config.run_command(args.command, args.varargs)
  File "/home/maxime/Desktop/vel/vel/internals/model_config.py", line 119, in run_command
    return command_descriptor.run(*varargs)
  File "/home/maxime/Desktop/vel/vel/rl/commands/rl_train_command.py", line 88, in run
    reinforcer.train_epoch(epoch_info)
  File "/home/maxime/Desktop/vel/vel/rl/reinforcers/on_policy_iteration_reinforcer.py", line 84, in train_epoch
    self.train_batch(batch_info)
  File "/home/maxime/Desktop/vel/vel/rl/reinforcers/on_policy_iteration_reinforcer.py", line 102, in train_batch
    rollout = self.env_roller.rollout(batch_info, self.model)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 46, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/maxime/Desktop/vel/vel/rl/env_roller/vec/step_env_roller.py", line 59, in rollout
    step = model.step(self.last_observation, state=self.hidden_state)
  File "/home/maxime/Desktop/vel/vel/rl/models/policy_gradient_rnn_model.py", line 97, in step
    action_pd_params, value_output, new_state = self(observations, state)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maxime/Desktop/vel/vel/rl/models/policy_gradient_rnn_model.py", line 88, in forward
    base_output, new_state = self.backbone(input_data, state=state)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maxime/Desktop/vel-miniworld/vel_miniworld/model/minigrid_obs_lstm.py", line 97, in forward
    fc_output = self.model(flat_observation)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 55, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/maxime/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1024, in linear
    return torch.addmm(bias, input, weight.t())
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:333
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Not exactly sure what's happening. Seems to be an index out of bounds in some CUDA kernel or something.

MillionIntegrals commented 5 years ago

The model you've used was still one for the minigrid - it had a different observation space and something didn't work out too well somewhere.

I managed to run it after I changed the model backbone section to this:

  backbone:
    name: vel.rl.models.backbone.nature_cnn_lstm

    hidden_units: 512

    input_width: 80
    input_height: 60
    input_channels: 3  # or 12, must be  3 * frame_history

I'll let you know if it converges for me or not.

MillionIntegrals commented 5 years ago

I've tried to played a bit around with parameters, but still the policy manages to solve the environment only approx 50% of the times. I'll try to take a closer loot at it later, but for now I wasn't able to reliably solve it.

maximecb commented 5 years ago

I was able to get around 60-70% success rate with the ikostrikov code. I don't really understand why this problem is so hard. Possibly because it requires very effective memory and training RNNs with RL doesn't work that well?

MillionIntegrals / vel-miniworld

Recurrent policy, MazeS2 env #1