Questions about expl.py and updating the batch dataset

Ashminator commented 2 years ago

Hi Danijar, I'm currently doing a project where I'm running DreamerV2 on some of the alternative exploration agents. I have two questions:

How does train_dataset update to include samples from the collected data in the training episodes? As far as I know, under the default settings for Pong, we have this line which creates the train_dataset:

print('Create agent.') train_dataset = iter(train_replay.dataset(**config.dataset))

And this line in the for loop which iterates over the batches.

for _ in range(config.train_steps): mets = train_agent(next(train_dataset))

I just wanted to sanity check with you that the next(train_dataset) batch is pulled from the entire buffer in train_replay._complete_eps, and that it's being updated as such, since I don't see train_dataset being updated after its initialisation. I also wanted to clarify that if the expl_behaviour is set to not greedy, the training episodes use the exploratory agent, and that data collected by this agent is sampled in subsequent batches of next(train_dataset). Possibly a silly question but in case I was missing something I tried the following modification:

for _ in range(config.train_steps): train_dataset = iter(train_replay.dataset(**config.dataset)) mets = train_agent(next(train_dataset))

Where train_dataset was re-initialised and I got worse results than the default behaviour.

I set the expl_behaviour equal to 'Plan2Explore' and tested on some toy games I'd based off the dSprites database, and got worse results than default (greedy) Dreamer, over one million episodes. The train_policy does indeed seem to explore the space quite well, but I was wondering if:

a) any steps needed to be done in order for Plan2Explore to work properly, other than just updating configs.yaml with expl_behavior: Plan2Explore (this is what I currently have)

b) it takes more than a few million steps for Plan2Explore to perform as well as default Dreamer. Here's a graph of the situation:

Note: I'd accidentally had action_repeat set to 4 in both these games, so divide by 4 to get the true number of steps on the x-axis.

Thanks in advance!

danijar commented 2 years ago

Yes, train_dataset uses an iterator that randomly samples from the full replay buffer (among all data collected so far). Plan2Explore tends to explore very well in the unsupervised setting but I think it's not entirely clear how to best combine it with task rewards --- it often just explores too much and thus gets worse task performance. But you can try exploring without rewards and use --expl_until to switch to a greedy task policy later. Hope that helps, unfortunately I won't be able to provide more detailed help than this :)

Ashminator commented 2 years ago

Thanks a lot Danijar! Really swift response, super helpful. I'll try implement either that or even expl_every may be a shout. I also think it's worth switching out the prefill stage with the exploratory agent for both the default and Plan2Explore cases and seeing results.

danijar / dreamerv2

Questions about expl.py and updating the batch dataset #43