Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
274 stars 26 forks source link

dreamer v3 resuming problem #273

Closed Disastorm closed 1 month ago

Disastorm commented 2 months ago

i noticed something when resuming. When I had stopped my training initially the envs were getting like 1k-2k rewards, and now after resuming they are only getting around 700. Did they loose some training or something?

image

Disastorm commented 2 months ago

Takes a few hours to get back to where i was before image

belerico commented 2 months ago

Hi @Disastorm, when you resume you should set the learning_starts or the per_rank_pretrain_steps accordingly. There could be a case where you saved a checkpoint at steps N and then you stopped the training at step N+M. In this case you have an old checkpoint with the buffer that it's kept since it's memory mapped, therefore when you resume you have trajectories coming from a future policy. You should spend some time to train your agent a little bit to overcome this issue. We never found a beautiful solution for this, but we're open to suggestions

Disastorm commented 2 months ago

is it memory mapped? the buffer checkpoint is set to False? Anyway if i adjust learning_starts is that basically similar to the steps it would do before training in the previous version? What would i set learning_starts to? Also fyi i stopped the training at like 215k while the checkpoint was at 200k. I wasn't aware there was anything that carried over aside from the checkpoint file itself when choosing resume from. It did create a new memory map so i dont think it re-used the old one?

belerico commented 2 months ago

Sorry, I haven't made myself clear in the previous post.

The behaviour that I mentioned happens if you chekpoint the buffer and the buffer is memory mapped.

If you checkpoint your buffer without memory mapping nothing happens, cause you can safely resume your training as the buffer is saved within the checkpoint.

If you don't save your buffer in the checkpoint, then you would pre-fill the buffer with the policy (agent) from the checkpoint so to recreate some sort of history of the buffer before the experiment was stopped and that's what's happening: your agent will pre-fill the buffer for at maximum old_learning_start steps before resuming training.

Right now there's no way to pre-fill the buffer after resuming from checkpoint. Am i right @michele-milesi?

Disastorm commented 2 months ago

So there are 2 methods of checkpointing the buffer? The default seems to just have checkpoint disabled. But you are saying you can enable either a memory mapped version or a version that is stored inside the checkpoint? And without the buffer, I know previously it used to do like 65k steps to prefil the buffer automatically, but I guess it doesnt do that now, so I should try to do something like for example if the checkpoint was at at 500k then I'd want to set training_starts to like 565k or something like that?

belerico commented 2 months ago

And without the buffer, I know previously it used to do like 65k steps to prefil the buffer automatically, but I guess it doesnt do that now, so I should try to do something like for example if the checkpoint was at at 500k then I'd want to set training_starts to like 565k or something like that?

This is already done. Suppose that you start an experiment with learning_starts=65k and you decide to not save the buffer in the checkpoint. Furthermore, suppose that you have a checkpoint at 500k. When you resume your checkpoint your old hydra config (the one that was used to launch the experiment that was stopped) is read and merged with the new one (the one where you're resuming your training), this means that if you had a learning_starts=65k then your agent will do 65k pre-fill steps of the buffer, so your resumed agent will start to train at 565k. If you want to increase the learning_starts, as a workaround you should increase that value from the hydra config.yaml of your stopped experiment

But you are saying you can enable either a memory mapped version or a version that is stored inside the checkpoint?

Yes, with buffer.memmap=True which is the default in the configs/exp/dreamer_v3.yaml config. In this case only the reference to the file is saved inside the checkpoint. so that when you resume we read that reference and the buffer is ready to go (with the issue that i explained in the previous-previous post)

michele-milesi commented 2 months ago

Hi there,

Right now there's no way to pre-fill the buffer after resuming from checkpoint. Am i right @michele-milesi?

So far, the buffer is pre filled only when you do not save the buffer in the checkpoint (with and without memory mapping).

@belerico we should add a config for choosing whether or not to prefill the buffer when resuming the checkpoint. It is a useful thing that we have experienced with our experiments.

Disastorm commented 2 months ago

I set learning starts to 65k but its still not prefilling, or at least it doesnt seem like it is because its slow from the beginning ( it used to be when it did the prefilling it was super fast ). *edit oh let me try with memmap false

Disastorm commented 2 months ago

ok i think it worked now when i put memmap false

Disastorm commented 2 months ago

actually its weird it started out good then got worse?

image

I originally stopped the training at around 2k rewards so the early rewards here are actually correct. Is it possible I'm doing too much prefilling or something like that? Also the prefilling is so slow compared to how it was before, it used to be like 65k steps prefilling would be done in like a few minutes but now its like over 30 minutes. Does the ratio affect prefilling too? Or maybe this isn't even prefilling? I think it is though because i think its a little faster than regular training. Or is it possible it prefilled before this with no logging, and then this is the beginning of the real training, which for some reason starts out fine then drops massively for no reason?

belerico commented 2 months ago

Hi @Disastorm, can you specify exactly what you have done? Maybe share your config for the first training and the one for the resuming. Thanks

Disastorm commented 2 months ago

I used default dreamer v3 large and changed replay_ratio to 0.2 for initial training. For resuming I just added the resume_from in the checkpoint.

That resulted in my initial post, basically it lost a whole bunch of training, my current model will drop from 2k rewards down to 700 or something, so it will lose like multiple hours of training.

Then I set learning_starts to 66k and it still didn't help anything. then i set memmap false and i got the image you see in my previous comment.

Your guys previous version of dreamerv3 had no problem resuming at all, it worked perfect and it will do the prefil with 65k steps and exactly resume where it left off. I have no idea what your new dreamer v3 is doing, but I have not yet been able to get it to resume properly. If I can't get it to resume, I might just revert back to your old dreamer and use that.

Disastorm commented 2 months ago

I think I remember before you had some issue with windows that you fixed (possibly related to memmap, or resuming), is it possible your new dreamerv3 has another issue with windows?

belerico commented 2 months ago

I've prepared here a branch where you can decide how many prefill steps you want to perform after resuming from the checkpoint. You can specify algo.learning_starts=N. Bear in mind that *N is divided by the `num_envs world_size, i.e. by the policy_steps**. This means that if you have M parallel envs and K processes (multi-gpu for example) thenlearning_starts = learning_starts // (M * K). We will fix this so the user can specify thelearning_starts` in policy steps.

Please remember that:

One thing that we can add is the possibility to save the buffer in the checkpoint by loading chunks into memory and save them in the checkpoint file: this would be super slow and will definitely hurt the disk memory, in particular if you're working with images and a large buffer

cc @michele-milesi

Disastorm commented 2 months ago

Have the definitions of buffer.checkpoint and buffer.memmap changed? I think back during this issue https://github.com/Eclectic-Sheep/sheeprl/issues/188

buffer.checkpoint basically meant each run was going to use the same memmap files from one of the previous runs, the actual same file in the same folder from the older run, so that it didnt create new memmap files for each run. The only thing in the checkpoint was just the path to the files. I don't actually know what buffer.memmap did as I just always had it on.

but it sounds like you are saying now buffer.checkpoint now stores the buffer in the checkpoint file, and buffer.memmap does what buffer.checkpoint used to do, or something like that?

I can go ahead and try your branch, too, but just wondering why is the learning_starts divided by num env?

belerico commented 2 months ago

Have the definitions of buffer.checkpoint and buffer.memmap changed? I think back during this issue https://github.com/Eclectic-Sheep/sheeprl/issues/188

Nothing changed from there that I know of

buffer.checkpoint basically meant each run was going to use the same memmap files from one of the previous runs, the actual same file in the same folder from the older run, so that it didnt create new memmap files for each run. The only thing in the checkpoint was just the path to the files.

This is just what's happening now

but it sounds like you are saying now buffer.checkpoint now stores the buffer in the checkpoint file, and buffer.memmap does what buffer.checkpoint used to do, or something like that?

This is not what i'm saying. What i wrote to you are the different scenarios that you could encounter.

I can go ahead and try your branch, too, but just wondering why is the learning_starts divided by num env?

Because we need to convert those steps into policy-steps

Disastorm commented 2 months ago

I see you are right your descriptions are actually the same, thanks. The steps that print out while training, are those the env steps or the policy steps?

belerico commented 2 months ago

Those are policy steps

Disastorm commented 2 months ago

It looks like in your branch, learning starts is already in policy steps. i just tested with 100k and it started trying to learn after 100k policy steps. However, it got this error:

...
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 686, in <listcomp>
    b.sample(
  File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 463, in sample
    return self._get_samples(
  File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 497, in _get_samples
    flattened_v = np.take(np.reshape(v, (-1, *v.shape[2:])), flattened_idxes, axis=0)
  File "<__array_function__ internals>", line 180, in take
  File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 190, in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
  File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 73.2 GiB for an array with shape (6400320, 3, 64, 64) and data type uint8
belerico commented 2 months ago

It looks like in your branch, learning starts is already in policy steps. i just tested with 100k and it started trying to learn after 100k policy steps.

It only happens if you're using 1 parallel env and 1 process (1 GPU for example), as you can see here.

However, it got this error:

...
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 686, in <listcomp>
b.sample(
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 463, in sample
return self._get_samples(
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 497, in _get_samples
flattened_v = np.take(np.reshape(v, (-1, *v.shape[2:])), flattened_idxes, axis=0)
File "<__array_function__ internals>", line 180, in take
File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 190, in take
return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 73.2 GiB for an array with shape (6400320, 3, 64, 64) and data type uint8

If you don't memmap your buffer and you don't have 80GB of RAM on your PC how is it possible to allocate that amount of RAM to hold all the images? We pre-allocate everything in the buffer because sooner or later you need to have that amount of data residing in the RAM.

Right now I'm a little bit lost on what is your issue here...

Disastorm commented 2 months ago

*edit just reverified the below.

Sorry didnt mention I set memmap back to true. This is the buffer settings from the same run that gave me the above error: image

It only happens if you're using 1 parallel env and 1 process (1 GPU for example), as you can see here.

Not commenting about the code, but in terms of testing that wasn't the case. I have 4 environments with 100k learning_starts. started from step 525k and it appeared to error out with the above error at around 625k: image

You can also see the reward_env0 -> reward_env3 is there.

learning_starts: image

belerico commented 2 months ago

Yeah, you're right about the learning starts: I got confused! Those are transformed to policy steps as in the link that I shared.

Acknowledging that, have you solved your issue about the resuming? What is your issue now? If this regards some memory issue i suggest you to open another issue

Disastorm commented 2 months ago

Yes I get that memory issue you saw before, however this is only happened when i did the last attempt which used your branch. I have never seen this error before on the main branch, although the main branch's learning starts doesnt seem to work right either, so I dont know if the error is related to your branch specifically or just the learning_starts functionality.

belerico commented 2 months ago

I've spotted the memory error. The problem is related to the replay-ratio: since the replay-ratio is the number of gradient steps per policy steps (i.e. a replay-ratio=0.5 means 1 gradient steps every 2 policy steps), when we resume from the checkpoint the raply-ratio loads its state and when you set the learning_starts to something > 0 then the first time the learning starts the Ratio class wants to keep maintaining the ratio, that's why you see it's slower when training the first time after resuming and that's also the cause of the memory error, since we sample all the needed trajectories once and loop through them. In the branch it should now be fixed the memory error. Can you confirm this?

Disastorm commented 2 months ago

looks like it got past the memory error, i see my gpu vram usage went up and my gpu is processing stuff, although i havn't seen a policy_step reward log after the prefill yet, even though its been almost 30 minutes which is very strange. Do you have any ideas about this? Pretty sure the policy step logs should be showing up in like 5 or at least 10 minutes normally with the ratio I'm using. Is it possible the logging broke, or there is some kind of infinite loop or something, or its perhaps reverted back to 1.0 ratio or something?

Disastorm commented 2 months ago

still no logs at all, no further checkpoints have been created either so i think somethings wrong with the training portion even though it passed the memory error, although ill keep it on for a total of an hour before i cancel it

belerico commented 2 months ago

looks like it got past the memory error, i see my gpu vram usage went up and my gpu is processing stuff, although i havn't seen a policy_step reward log after the prefill yet, even though its been almost 30 minutes which is very strange. Do you have any ideas about this?

As i told you in the previous answer the slowdown you see is due to the replay ratio. Suppose for example the following:

When you resume your training the Ratio class knows that 4000 steps have already been done so far. Now, you want to do 2048 pre-fill steps and to maintain the replay-ratio at 0.5 the Ratio class will return a number of training step equal to (6048-4000) 0.5 = 2048 0.5 = 1024. This means that the first time you resume, to maintain the replay-ratio, you will do 1024 training steps. That's why you see a slowdown. Is this answer your doubts?

Disastorm commented 2 months ago

Sorry I don't really understand the details, Is there a way I can get it to train at the normal speed instead of like 10 or 20 times slower that I guess it might be doing?

belerico commented 2 months ago

From this branch can you try to set learning_starts=0 after resuming?

Disastorm commented 2 months ago

you mean before resuming right? Or do you want me to resume with a learning_starts value, cancel and then resume again with it set to 0 or something like that?

If i set it to 0, is it going to prefill at all though?

belerico commented 2 months ago

This is an experiment i've done with the new branch:

image

I've run a training with the following config:

python sheeprl.py exp=dreamer_v3 \                   
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.replay_ratio=0.5 \
algo.world_model.decoupled_rssm=False \
algo.world_model.learnable_initial_recurrent_state=False

Then i've stoppend the training and resumed it with learning_starts=0, depicted in blue in the above figure.

Then i've stopped again the training and resumed it again with learning_starts=0, depicted in red in the above figure.

As you can see the training resumed perfectly!

Or do you want me to resume with a learning_starts value, cancel and then resume again with it set to 0 or something like that?

Yes, i want you to start a training with a learning_starts=N, cancel it and then resume it with learning_starts=0

Disastorm commented 2 months ago

I'll try it but I"m just wondering how does the prefill work, does it automatically detect some amount to prefill even if you have it set to 0? You are using checkpoint false right?

belerico commented 2 months ago

@Disastorm, please come in this google meet

Disastorm commented 2 months ago

@belerico I'll try out the stuff you mentioned in the meeting, tommorow, sorry don't have the time right now to do it.

Disastorm commented 2 months ago

@belerico So the resuming with checkpoint: true and memmap: true does work as you've said although when checkpoint is disabled and pre-filling is attempted my attempts there have always seemed to be abnormally slow, so I've stopped trying to attempt that alternative. I'm just going to stick with the checkpoint: true and memmap: true.

belerico commented 2 months ago

@michele-milesi we should decide what to do when we don't checkpoint the buffer and we need to prefill the buffer. The simplest thing is to disable the replay-ratio. Another solution is to dilute the pre-fill steps over the course of the agent training. What do you think?

Disastorm commented 2 months ago

I dont really know about that so I can't really help. I guess forcing the replay-ratio to 1 could be an ok solution but it could be annoying if someone wants a different ratio, and theyll need to reconfigure how often they save a checkpoint and whatnot since the steps are going to be slower.

michele-milesi commented 2 months ago

@michele-milesi we should decide what to do when we don't checkpoint the buffer and we need to prefill the buffer. The simplest thing is to disable the replay-ratio. Another solution is to dilute the pre-fill steps over the course of the agent training. What do you think?

@belerico, what if we pre-filled the dataset with the number of policy steps played at the time of the checkpoint and continued training as if nothing had happened? (during the pre-fill phase, we do not increase the policy steps).

For example, the experiment was interrupted at the 100_000 (policy) step, and the buffer was not saved into the checkpoint. When resuming the experiment, the agent plays 100_000 (policy) steps to pre-fill the dataset. When the pre-fill is done, the training is resumed from step 100_000.

This way we would (more or less) have the same situation we had at the moment of the checkpoint, and the ratio would not be affected by the pre-fill. Otherwise, as you say, we could dilute the pre-fill steps over the course of the agent training.