astooke / rlpyt

Reinforcement Learning in PyTorch
MIT License
2.23k stars 324 forks source link

multi-node training? #134

Closed sharr6 closed 4 years ago

sharr6 commented 4 years ago

Hi, I am trying make few modifications to training multi-node (some envs steps so slow, add sampler-machines could significantly reduce overall time).

Added zmq module and split runner to "actor" and "learner" in two machines. in R2D1 algo, when 1 actor(sampler) -> 1 learner it works fine, but 2 actors -> 1 learner, the model failed to learn.

Does it has something to do with RNN state in R2D1 algo? I'v looked into the paper, says there are two strategies RNN state "Stored state" and "Burn-in". Any where in config file could swich those strategies in R2D1 algo?

astooke commented 4 years ago

OK yes zmq sounds like a good way to do the communication!

Without seeing your code it's impossible for me to say what's going on. Can you share any of it?

The best setting in R2D2 was to store the RNN state and then still do a burn-in of 40 timesteps (in rlpyt called "warmup") before the 80 timesteps training, so I would recommend keeping that.

sharr6 commented 4 years ago

@astooke, my mistake, strategies would be "zero start" and "burn-in", as in paper says:

"The zero start state strategy’s appeal lies in its simplicity, and it allows independent decorrelated sampling of relatively short sequences, which is important for robust optimization of a neural network.On the other hand, it forces the RNN to learn to recover meaningful predictions from an atypical initial recurrent state (‘initial recurrent state mismatch’), which may limit its ability to fully rely on its recurrent state and learn to exploit long temporal correlations. The second strategy on the other hand avoids the problem of finding a suitable initial state, but creates a number of practical, computational, and algorithmic issues due to varying and potentially environment-dependent sequence length, and higher variance of network updates because of the highly correlated nature of states in a trajectory when compared to training on randomly sampled batches of experience tuples."

guess i am stuck in "independent decorrelated sampling", so the problem would be: can i switch to "zero start" strategy in this awesome lib? :)

astooke commented 4 years ago

yes, you can switch to zero start by setting warmup_T=0 in the input to the r2d1 algo. :)

sharr6 commented 4 years ago

I ‘v tried (warmup_T=0 & store_rnn_state_interval=0) but the problem remains, doubt if it's cause by time sequence corruption, are those "non_sequence & sequence" in replays refers “time correlation between batchs”? Wonder if r2d1 algo able to use "non_sequence_replay_buffer"?

sharr6 commented 4 years ago

@astooke, after some experiments, spotted the problem cause by "time sequence between batchs corupted in AsyncPrioritizedSequenceReplayFrameBuffer", few changes to "memory_copier" could reproduce it (shuffle buffer sequence cause model failed to learn):

def memory_copier(sample_buffer, samples_to_buffer, replay_buffer, ctrl):
    torch.set_num_threads(1)
    lis = []
    while True:
        ctrl.sample_ready.acquire()
        if ctrl.quit.value:
            break
        # replay_buffer.append_samples(samples_to_buffer(sample_buffer))
        lis.append(sample_buffer)
        if len(lis) > 10:
            random.shuffle(lis)
            for li in lis:
                replay_buffer.append_samples(samples_to_buffer(li))
            lis = []
        ctrl.sample_copied.release()
    logger.log("Memory copier shutting down.")

Could u give some guidance how to fix it?

astooke commented 4 years ago

Hi, r2d1 needs a sequence replay buffer because it is a recurrent agent, so it must train on sequences. So the samples must be added to the replay buffer in the same order they were experiences, not in shuffled order.

sharr6 commented 4 years ago

I see, is it possible to contain the sequences within a batch and break the correlation between batchs.

astooke commented 4 years ago

Hi, I'm not sure exactly what you mean, do you want to replay a single, full-length episode for one minibatch? This would be an easy modification to how the replay samples from the buffer.

Or else if you're looking to decorrelate the samples within a minibatch, this is done by the minibatch batch_B parameter, which sets the number of different trajectories to sample into the minibatch. For example batch_T=10, batch_B=5 gives 5 different trajectories of length 10 timesteps each.

sharr6 commented 4 years ago

Ok, tks alot!