Closed Disastorm closed 6 months ago
Takes a few hours to get back to where i was before
Hi @Disastorm, when you resume you should set the learning_starts
or the per_rank_pretrain_steps
accordingly.
There could be a case where you saved a checkpoint at steps N and then you stopped the training at step N+M. In this case you have an old checkpoint with the buffer that it's kept since it's memory mapped, therefore when you resume you have trajectories coming from a future policy. You should spend some time to train your agent a little bit to overcome this issue.
We never found a beautiful solution for this, but we're open to suggestions
is it memory mapped? the buffer checkpoint is set to False? Anyway if i adjust learning_starts is that basically similar to the steps it would do before training in the previous version? What would i set learning_starts to? Also fyi i stopped the training at like 215k while the checkpoint was at 200k. I wasn't aware there was anything that carried over aside from the checkpoint file itself when choosing resume from. It did create a new memory map so i dont think it re-used the old one?
Sorry, I haven't made myself clear in the previous post.
The behaviour that I mentioned happens if you chekpoint the buffer and the buffer is memory mapped.
If you checkpoint your buffer without memory mapping nothing happens, cause you can safely resume your training as the buffer is saved within the checkpoint.
If you don't save your buffer in the checkpoint, then you would pre-fill the buffer with the policy (agent) from the checkpoint so to recreate some sort of history of the buffer before the experiment was stopped and that's what's happening: your agent will pre-fill the buffer for at maximum old_learning_start
steps before resuming training.
Right now there's no way to pre-fill the buffer after resuming from checkpoint. Am i right @michele-milesi?
So there are 2 methods of checkpointing the buffer? The default seems to just have checkpoint disabled. But you are saying you can enable either a memory mapped version or a version that is stored inside the checkpoint? And without the buffer, I know previously it used to do like 65k steps to prefil the buffer automatically, but I guess it doesnt do that now, so I should try to do something like for example if the checkpoint was at at 500k then I'd want to set training_starts to like 565k or something like that?
And without the buffer, I know previously it used to do like 65k steps to prefil the buffer automatically, but I guess it doesnt do that now, so I should try to do something like for example if the checkpoint was at at 500k then I'd want to set training_starts to like 565k or something like that?
This is already done. Suppose that you start an experiment with learning_starts=65k
and you decide to not save the buffer in the checkpoint. Furthermore, suppose that you have a checkpoint at 500k
. When you resume your checkpoint your old hydra config (the one that was used to launch the experiment that was stopped) is read and merged with the new one (the one where you're resuming your training), this means that if you had a learning_starts=65k
then your agent will do 65k pre-fill steps of the buffer, so your resumed agent will start to train at 565k. If you want to increase the learning_starts
, as a workaround you should increase that value from the hydra config.yaml
of your stopped experiment
But you are saying you can enable either a memory mapped version or a version that is stored inside the checkpoint?
Yes, with buffer.memmap=True
which is the default in the configs/exp/dreamer_v3.yaml
config. In this case only the reference to the file is saved inside the checkpoint. so that when you resume we read that reference and the buffer is ready to go (with the issue that i explained in the previous-previous post)
Hi there,
Right now there's no way to pre-fill the buffer after resuming from checkpoint. Am i right @michele-milesi?
So far, the buffer is pre filled only when you do not save the buffer in the checkpoint (with and without memory mapping).
@belerico we should add a config for choosing whether or not to prefill the buffer when resuming the checkpoint. It is a useful thing that we have experienced with our experiments.
I set learning starts to 65k but its still not prefilling, or at least it doesnt seem like it is because its slow from the beginning ( it used to be when it did the prefilling it was super fast ). *edit oh let me try with memmap false
ok i think it worked now when i put memmap false
actually its weird it started out good then got worse?
I originally stopped the training at around 2k rewards so the early rewards here are actually correct. Is it possible I'm doing too much prefilling or something like that? Also the prefilling is so slow compared to how it was before, it used to be like 65k steps prefilling would be done in like a few minutes but now its like over 30 minutes. Does the ratio affect prefilling too? Or maybe this isn't even prefilling? I think it is though because i think its a little faster than regular training. Or is it possible it prefilled before this with no logging, and then this is the beginning of the real training, which for some reason starts out fine then drops massively for no reason?
Hi @Disastorm, can you specify exactly what you have done? Maybe share your config for the first training and the one for the resuming. Thanks
I used default dreamer v3 large and changed replay_ratio to 0.2 for initial training. For resuming I just added the resume_from in the checkpoint.
That resulted in my initial post, basically it lost a whole bunch of training, my current model will drop from 2k rewards down to 700 or something, so it will lose like multiple hours of training.
Then I set learning_starts to 66k and it still didn't help anything. then i set memmap false and i got the image you see in my previous comment.
Your guys previous version of dreamerv3 had no problem resuming at all, it worked perfect and it will do the prefil with 65k steps and exactly resume where it left off. I have no idea what your new dreamer v3 is doing, but I have not yet been able to get it to resume properly. If I can't get it to resume, I might just revert back to your old dreamer and use that.
I think I remember before you had some issue with windows that you fixed (possibly related to memmap, or resuming), is it possible your new dreamerv3 has another issue with windows?
I've prepared here a branch where you can decide how many prefill steps you want to perform after resuming from the checkpoint. You can specify algo.learning_starts=N
. Bear in mind that *N is divided by the `num_envs world_size, i.e. by the policy_steps**. This means that if you have M parallel envs and K processes (multi-gpu for example) then
learning_starts = learning_starts // (M * K). We will fix this so the user can specify the
learning_starts` in policy steps.
Please remember that:
buffer.memmap=True
and buffer.checkpoint=True
and you stop your training aftrer saving a checkpoint, then you will find in the buffer some trajectories coming from a future agent (the one that you were training and stopped). That's because the buffer is kept on the disk and right now we don't have a way to know where was the last position pointing at in the buffer. Here we can memmap also the buffer metadata so that we can retrieve the last saved pos and delete all the "future" trajectories buffer.checkpoint=False
then you need to do some pre-fill with your resumed agent. The problem of not saving the buffer is that an off-policy agent learns from both new and old trajectories but you will end up with just new trajectories and I don't know what is happening there from a learning point of view. In this case we can add both trajectories sampled from the env and some sampled from the resumed agent. Moreover, if we have a priority buffer we could sample with higher probability newer trajectories w.r.t. older ones with lower probabilitybuffer.checkpoint=True
and buffer.memmap=False
then you're safe cause your buffer is saved in the checkpointOne thing that we can add is the possibility to save the buffer in the checkpoint by loading chunks into memory and save them in the checkpoint file: this would be super slow and will definitely hurt the disk memory, in particular if you're working with images and a large buffer
cc @michele-milesi
Have the definitions of buffer.checkpoint and buffer.memmap changed? I think back during this issue https://github.com/Eclectic-Sheep/sheeprl/issues/188
buffer.checkpoint basically meant each run was going to use the same memmap files from one of the previous runs, the actual same file in the same folder from the older run, so that it didnt create new memmap files for each run. The only thing in the checkpoint was just the path to the files. I don't actually know what buffer.memmap did as I just always had it on.
but it sounds like you are saying now buffer.checkpoint now stores the buffer in the checkpoint file, and buffer.memmap does what buffer.checkpoint used to do, or something like that?
I can go ahead and try your branch, too, but just wondering why is the learning_starts divided by num env?
Have the definitions of buffer.checkpoint and buffer.memmap changed? I think back during this issue https://github.com/Eclectic-Sheep/sheeprl/issues/188
Nothing changed from there that I know of
buffer.checkpoint basically meant each run was going to use the same memmap files from one of the previous runs, the actual same file in the same folder from the older run, so that it didnt create new memmap files for each run. The only thing in the checkpoint was just the path to the files.
This is just what's happening now
but it sounds like you are saying now buffer.checkpoint now stores the buffer in the checkpoint file, and buffer.memmap does what buffer.checkpoint used to do, or something like that?
This is not what i'm saying. What i wrote to you are the different scenarios that you could encounter.
I can go ahead and try your branch, too, but just wondering why is the learning_starts divided by num env?
Because we need to convert those steps into policy-steps
I see you are right your descriptions are actually the same, thanks. The steps that print out while training, are those the env steps or the policy steps?
Those are policy steps
It looks like in your branch, learning starts is already in policy steps. i just tested with 100k and it started trying to learn after 100k policy steps. However, it got this error:
...
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 686, in <listcomp>
b.sample(
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 463, in sample
return self._get_samples(
File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 497, in _get_samples
flattened_v = np.take(np.reshape(v, (-1, *v.shape[2:])), flattened_idxes, axis=0)
File "<__array_function__ internals>", line 180, in take
File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 190, in take
return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc
return bound(*args, **kwds)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 73.2 GiB for an array with shape (6400320, 3, 64, 64) and data type uint8
It looks like in your branch, learning starts is already in policy steps. i just tested with 100k and it started trying to learn after 100k policy steps.
It only happens if you're using 1 parallel env and 1 process (1 GPU for example), as you can see here.
However, it got this error:
... File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 686, in <listcomp> b.sample( File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 463, in sample return self._get_samples( File "H:\aiWorkspace\gymRetro\rl\sheeprl\sheeprl\sheeprl\data\buffers.py", line 497, in _get_samples flattened_v = np.take(np.reshape(v, (-1, *v.shape[2:])), flattened_idxes, axis=0) File "<__array_function__ internals>", line 180, in take File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 190, in take return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode) File "C:\Users\Disastorm\MiniConda3\envs\sheeprl\lib\site-packages\numpy\core\fromnumeric.py", line 57, in _wrapfunc return bound(*args, **kwds) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 73.2 GiB for an array with shape (6400320, 3, 64, 64) and data type uint8
If you don't memmap your buffer and you don't have 80GB of RAM on your PC how is it possible to allocate that amount of RAM to hold all the images? We pre-allocate everything in the buffer because sooner or later you need to have that amount of data residing in the RAM.
Right now I'm a little bit lost on what is your issue here...
*edit just reverified the below.
Sorry didnt mention I set memmap back to true. This is the buffer settings from the same run that gave me the above error:
It only happens if you're using 1 parallel env and 1 process (1 GPU for example), as you can see here.
Not commenting about the code, but in terms of testing that wasn't the case. I have 4 environments with 100k learning_starts. started from step 525k and it appeared to error out with the above error at around 625k:
You can also see the reward_env0 -> reward_env3 is there.
learning_starts:
Yeah, you're right about the learning starts: I got confused! Those are transformed to policy steps as in the link that I shared.
Acknowledging that, have you solved your issue about the resuming? What is your issue now? If this regards some memory issue i suggest you to open another issue
Yes I get that memory issue you saw before, however this is only happened when i did the last attempt which used your branch. I have never seen this error before on the main branch, although the main branch's learning starts doesnt seem to work right either, so I dont know if the error is related to your branch specifically or just the learning_starts functionality.
I've spotted the memory error. The problem is related to the replay-ratio: since the replay-ratio is the number of gradient steps per policy steps (i.e. a replay-ratio=0.5 means 1 gradient steps every 2 policy steps), when we resume from the checkpoint the raply-ratio loads its state and when you set the learning_starts to something > 0 then the first time the learning starts the Ratio class wants to keep maintaining the ratio, that's why you see it's slower when training the first time after resuming and that's also the cause of the memory error, since we sample all the needed trajectories once and loop through them. In the branch it should now be fixed the memory error. Can you confirm this?
looks like it got past the memory error, i see my gpu vram usage went up and my gpu is processing stuff, although i havn't seen a policy_step reward log after the prefill yet, even though its been almost 30 minutes which is very strange. Do you have any ideas about this? Pretty sure the policy step logs should be showing up in like 5 or at least 10 minutes normally with the ratio I'm using. Is it possible the logging broke, or there is some kind of infinite loop or something, or its perhaps reverted back to 1.0 ratio or something?
still no logs at all, no further checkpoints have been created either so i think somethings wrong with the training portion even though it passed the memory error, although ill keep it on for a total of an hour before i cancel it
looks like it got past the memory error, i see my gpu vram usage went up and my gpu is processing stuff, although i havn't seen a policy_step reward log after the prefill yet, even though its been almost 30 minutes which is very strange. Do you have any ideas about this?
As i told you in the previous answer the slowdown you see is due to the replay ratio. Suppose for example the following:
When you resume your training the Ratio class knows that 4000 steps have already been done so far. Now, you want to do 2048 pre-fill steps and to maintain the replay-ratio at 0.5 the Ratio class will return a number of training step equal to (6048-4000) 0.5 = 2048 0.5 = 1024. This means that the first time you resume, to maintain the replay-ratio, you will do 1024 training steps. That's why you see a slowdown. Is this answer your doubts?
Sorry I don't really understand the details, Is there a way I can get it to train at the normal speed instead of like 10 or 20 times slower that I guess it might be doing?
From this branch can you try to set learning_starts=0
after resuming?
you mean before resuming right? Or do you want me to resume with a learning_starts value, cancel and then resume again with it set to 0 or something like that?
If i set it to 0, is it going to prefill at all though?
This is an experiment i've done with the new branch:
I've run a training with the following config:
python sheeprl.py exp=dreamer_v3 \
env=gym env.id=CartPole-v1 \
env.num_envs=4 \
fabric.accelerator=gpu \
fabric.precision=16-mixed \
algo=dreamer_v3_S \
algo.learning_starts=1024 \
algo.cnn_keys.encoder=\[\] \
algo.mlp_keys.encoder=\["vector"\] \
algo.cnn_keys.decoder=\[\] \
algo.mlp_keys.decoder=\["vector"\] \
algo.per_rank_sequence_length=64 \
algo.replay_ratio=0.5 \
algo.world_model.decoupled_rssm=False \
algo.world_model.learnable_initial_recurrent_state=False
Then i've stoppend the training and resumed it with learning_starts=0
, depicted in blue in the above figure.
Then i've stopped again the training and resumed it again with learning_starts=0
, depicted in red in the above figure.
As you can see the training resumed perfectly!
Or do you want me to resume with a learning_starts value, cancel and then resume again with it set to 0 or something like that?
Yes, i want you to start a training with a learning_starts=N
, cancel it and then resume it with learning_starts=0
I'll try it but I"m just wondering how does the prefill work, does it automatically detect some amount to prefill even if you have it set to 0? You are using checkpoint false right?
@Disastorm, please come in this google meet
@belerico I'll try out the stuff you mentioned in the meeting, tommorow, sorry don't have the time right now to do it.
@belerico So the resuming with checkpoint: true and memmap: true does work as you've said although when checkpoint is disabled and pre-filling is attempted my attempts there have always seemed to be abnormally slow, so I've stopped trying to attempt that alternative. I'm just going to stick with the checkpoint: true and memmap: true.
@michele-milesi we should decide what to do when we don't checkpoint the buffer and we need to prefill the buffer. The simplest thing is to disable the replay-ratio. Another solution is to dilute the pre-fill steps over the course of the agent training. What do you think?
I dont really know about that so I can't really help. I guess forcing the replay-ratio to 1 could be an ok solution but it could be annoying if someone wants a different ratio, and theyll need to reconfigure how often they save a checkpoint and whatnot since the steps are going to be slower.
@michele-milesi we should decide what to do when we don't checkpoint the buffer and we need to prefill the buffer. The simplest thing is to disable the replay-ratio. Another solution is to dilute the pre-fill steps over the course of the agent training. What do you think?
@belerico, what if we pre-filled the dataset with the number of policy steps played at the time of the checkpoint and continued training as if nothing had happened? (during the pre-fill phase, we do not increase the policy steps).
For example, the experiment was interrupted at the 100_000
(policy) step, and the buffer was not saved into the checkpoint. When resuming the experiment, the agent plays 100_000
(policy) steps to pre-fill the dataset. When the pre-fill is done, the training is resumed from step 100_000
.
This way we would (more or less) have the same situation we had at the moment of the checkpoint, and the ratio would not be affected by the pre-fill. Otherwise, as you say, we could dilute the pre-fill steps over the course of the agent training.
i noticed something when resuming. When I had stopped my training initially the envs were getting like 1k-2k rewards, and now after resuming they are only getting around 700. Did they loose some training or something?