facebookresearch / hanabi_SAD

Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Other
95 stars 34 forks source link

Out of memory error #14

Open hnekoeiq opened 3 years ago

hnekoeiq commented 3 years ago

Hello, Thanks a lot for updating the installation process. I am able to successfully build it. However, when I run dev.sh, my job gets killed after some epochs. I made sure that I have 2 gpus, I added cuda:0 and cuda:1 to the script, and I added export OMP_NUM_THREADS=1 as well. But the log is still like this:

==========
beginning of epoch:  27
available: 254.165 GB, used: 101.650 GB, free: 177.793 GB
EPOCH: 27
Speed: train: 2190.9, act: 8996.2, buffer_add: 173.0, buffer_size: 129310
Total Time: 0H 10M 52S, 652s
Total Sample: train: 1.434M, act: 5.836M
@@@Time
    sync and updating : 0 MS, 1.36%
    sample data       : 0 MS, 0.34%
    forward & backward: 52 MS, 89.19%
    update model      : 5 MS, 9.05%
    updating priority : 0 MS, 0.07%
@@@total time per iter: 58.39 ms
[27] Time spent = 23.37 s
27:grad_norm [ 400]: avg:   3.4430, min:   0.8571[  92], max:  15.7686[ 371]
27:loss      [ 400]: avg:   0.8180, min:   0.2553[  92], max:   1.2099[ 314]
27:rl_loss   [ 400]: avg:   0.1122, min:   0.0801[ 311], max:   0.1505[   6]
epoch 27, eval score: 1.3670, perfect: 0.00, model saved: True
==========
beginning of epoch:  28
available: 252.752 GB, used: 103.027 GB, free: 176.281 GB
Killed

It seems there is a memory leakage somewhere because free memory is reduced quickly after every epoch. Any idea what could be the reason? Thanks in advance for your help.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu , I am encountering similar issues. It will be very useful if you could look into it.

hengyuan-hu commented 3 years ago

Can you show me the log of 2 or 3 epochs? Is the size of the replay buffer as shown in the log buffer_size: 129310 still increasing?

The memory consumption will increase rapidly at the beginning because it is filling the replay buffer. Once the replay buffer is full, new data will then evict old data and keep the memory consumption roughly unchanged. From our experiments, a full replay buffer with the default size, 2^17, will take up to 200~250GB.

hnekoeiq commented 3 years ago

Here is another log. As you can see, the replay buffer size is reached to 2^17 and it's fixed but my job still got killed.

==========
beginning of epoch:  38
available: 292.605 GB, used: 78.168 GB, free: 219.034 GB
EPOCH: 38
Speed: train: 2375.2, act: 9374.6, buffer_add: 139.7, buffer_size: 131074
Total Time: 0H 13M 50S, 830s
Total Sample: train: 1.997M, act: 7.522M
@@@Time
    sync and updating : 0 MS, 1.49%
    sample data       : 0 MS, 0.38%
    forward & backward: 48 MS, 90.96%
    update model      : 3 MS, 7.11%
    updating priority : 0 MS, 0.05%
@@@total time per iter: 53.86 ms
[38] Time spent = 21.56 s
38:grad_norm [ 400]: avg:   4.2888, min:   1.4623[ 314], max:  13.4655[ 372]
38:loss      [ 400]: avg:   0.9588, min:   0.6525[ 267], max:   1.2350[ 112]
38:rl_loss   [ 400]: avg:   0.0884, min:   0.0645[ 186], max:   0.1215[ 329]
worst perf idx inside save is  4
epoch 38, eval score: 2.4590, perfect: 0.00, model saved: True
==========
beginning of epoch:  39
available: 287.828 GB, used: 82.030 GB, free: 214.398 GB
EPOCH: 39
Speed: train: 2382.0, act: 9304.8, buffer_add: 138.8, buffer_size: 131079
Total Time: 0H 14M 11S, 851s
Total Sample: train: 2.048M, act: 7.722M
@@@Time
    sync and updating : 0 MS, 1.44%
    sample data       : 0 MS, 0.39%
    forward & backward: 48 MS, 89.56%
    update model      : 4 MS, 8.56%
    updating priority : 0 MS, 0.05%
@@@total time per iter: 53.71 ms
[39] Time spent = 21.50 s
39:grad_norm [ 400]: avg:   4.6096, min:   1.3671[  52], max:  21.7430[  64]
39:loss      [ 400]: avg:   1.0028, min:   0.6306[ 165], max:   1.4594[ 389]
39:rl_loss   [ 400]: avg:   0.0895, min:   0.0631[ 286], max:   0.1158[  37]
worst perf idx inside save is  1
epoch 39, eval score: 2.7510, perfect: 0.00, model saved: True
==========
beginning of epoch:  40
available: 287.670 GB, used: 81.676 GB, free: 214.120 GB
EPOCH: 40
Speed: train: 2390.4, act: 9386.0, buffer_add: 141.7, buffer_size: 131081
Total Time: 0H 14M 33S, 873s
Total Sample: train: 2.099M, act: 7.923M
@@@Time
    sync and updating : 0 MS, 1.42%
    sample data       : 0 MS, 0.35%
    forward & backward: 48 MS, 90.98%
    update model      : 3 MS, 7.20%
    updating priority : 0 MS, 0.05%
@@@total time per iter: 53.52 ms
[40] Time spent = 21.42 s
40:grad_norm [ 400]: avg:   3.7216, min:   1.2324[ 251], max:  15.1563[ 212]
40:loss      [ 400]: avg:   1.0099, min:   0.6576[ 108], max:   1.4499[   0]
40:rl_loss   [ 400]: avg:   0.0899, min:   0.0626[  54], max:   0.1197[ 196]
worst perf idx inside save is  0
epoch 40, eval score: 2.9620, perfect: 0.00, model saved: True
==========
beginning of epoch:  41
available: 287.680 GB, used: 81.702 GB, free: 213.952 GB
EPOCH: 41
Speed: train: 2396.6, act: 9345.0, buffer_add: 140.9, buffer_size: 131080
Total Time: 0H 14M 54S, 894s
Total Sample: train: 2.15M, act: 8.123M
@@@Time
    sync and updating : 0 MS, 1.49%
    sample data       : 0 MS, 0.35%
    forward & backward: 48 MS, 90.68%
    update model      : 3 MS, 7.44%
    updating priority : 0 MS, 0.05%
@@@total time per iter: 53.38 ms
[41] Time spent = 21.36 s
41:grad_norm [ 400]: avg:   3.7675, min:   1.3444[ 343], max:  14.8790[ 310]
41:loss      [ 400]: avg:   1.0476, min:   0.5923[ 203], max:   1.4457[ 247]
41:rl_loss   [ 400]: avg:   0.0916, min:   0.0669[ 142], max:   0.1297[ 185]
worst perf idx inside save is  2
epoch 41, eval score: 2.9120, perfect: 0.00, model saved: True
==========
beginning of epoch:  42
available: 287.150 GB, used: 82.268 GB, free: 213.250 GB

The replay buffer is completely full at the beginning of epoch: 32 available: 292.916 GB, used: 76.393 GB, free: 220.252 GB . But the memory consumption is still increased after that.

hengyuan-hu commented 3 years ago

This is similar to what we observed. From our experience, once the replay buffer is full, we observe the memory consumption fluctuate at a small rate. It should not grow infinitely though, we normally train for 3 days and the memory consumption stays relatively flat after the initial several hours.

You may either try to allocate more memory to it (I guess it has not taken over the entire 280GB mem but rather hit a pre-set limit?) or reduce the size of the replay buffer to see if you can then train it for longer. We have trained models using replay buffer of size 10000 (25% smaller), or even 2^16. It should work fine.

There could be many reasons for the memory increasing. For example, the implementation of replay buffer allows extra 25% storage for fast parallel write; the python code also create copies of state_dict to sync models between C++ & Python and it may not get freed right away...

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu , thanks a lot for getting back and sharing your observations. It is indeed hard to find machines with ~250GB memory especially in an academic setting even after reducing the replay buffer size to 2^16. We find that memory consumption still continues to increase (although slowly) after the replay buffer is full.

Is there any principled way to fix this memory leak without possibly reducing the size of the replay buffer?

hengyuan-hu commented 3 years ago

Do you still run out of memory even after halving the size of replay buffer?

I am looking deeper into this problem now. It might take some time to figure out and fix.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu : Yes, we still run out of memory even after halving the replay buffer size except in cases where we have ~250GB RAM. Some of our experimental settings may end up needing a larger replay buffer size (so not sure if reducing its size is a good solution). From some of my observations looking at the training log for IQL_2p setting with replay buffer size 2^16, used at the beginning of the experiment is about 37 GB that increases to 69 GB at which point the replay buffer is full and continues to increase till 95 GB at about 2000 epochs. Does this suggest that there are other factors that cause a considerable increase in the used memory?

hengyuan-hu commented 3 years ago

Yes I just narrowed down the main cause. It is in the evaluation code that cause the memory increase at the end of each epoch. For a quick fix, you can disable the evaluation here or make it less frequent (increase epoch length) or evaluate on less games, like 100...

I am still figuring out exactly where in eval causes the leak. It seem less obvious...

hengyuan-hu commented 3 years ago

I pushed a fix to the master. I couldn't find potential places for memory leak and neither could Valgrind. I suspect the reason is fragmentation instead of leaking, i.e. we create/free large chunk of memory from time to time and they are not pinned, which means that the program just keep taking over more memories because previously allocated memory is partially occupied by small data. The fix pushed removes one of the most dominant case of such problem.

This fix will not completely solve everything as we still re-create vector of environments & actors at evaluation time. But with this fix, the memory grows at a more acceptable rate of <2.5G per 500 epochs/evaluations.

The most effective way to save memory is to encode the data in a memory efficient way before saving it to replay buffer and decode it at prefetch time before training. You may want to consider that options as well.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu , thanks for providing a fix. How were you using Valgrind to find the cause of the memory leak? In my case, I have Valgrind version 3.15.0 but it always ends up with Illegal instruction (core dumped) before python script actually starts executing.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu , thanks again for providing the fix but just letting you know that there's little to no difference in memory consumption even with the latest commit in master.

hengyuan-hu commented 3 years ago

I use the following command for valgrind

valgrind --leak-check=yes python selfplay.py \
       --save_dir exps/run2 \
       --method iql \
       --num_thread 10 \
       --num_game_per_thread 1 \
       --sad 1 \
       --act_base_eps 0.1 \
       --act_eps_alpha 7 \
       --lr 6.25e-05 \
       --eps 1.5e-05 \
       --grad_clip 5 \
       --gamma 0.999 \
       --seed 1 \
       --batchsize 128 \
       --burn_in_frames 500 \
       --replay_buffer_size 2000 \
       --epoch_len 40 \
       --priority_exponent 0.9 \
       --priority_weight 0.6 \
       --train_bomb 0 \
       --eval_bomb 0 \
       --num_player 2 \
       --rnn_hid_dim 512 \
       --act_device cuda:1 \
       --shuffle_color 1 \

Ok I will keep investigating.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu , any new causes for memory leak from your investigation? Thanks a lot!

akileshbadrinaaraayanan commented 3 years ago

Btw, just noticed an one-line bug here : https://github.com/facebookresearch/hanabi_SAD/blob/master/pyhanabi/eval.py#L23, assert agents is None or runnes is None to assert agents is None or runners is None

hengyuan-hu commented 3 years ago

Thanks for the catch. I still don't know the reason yet. I have been playing with some re-structuring to reduce number of unnecessary tensor creations but have not find a solution.

akileshbadrinaaraayanan commented 3 years ago

Hi @hengyuan-hu ,

Did you find the fix for this? I must admit that it really gets in the way of running many experiments :(

hengyuan-hu commented 3 years ago

Have you tried disable evaluation all together? i.e. comment out the code between https://github.com/facebookresearch/hanabi_SAD/blob/502d6a7a52028511704c944dffe1945194e10c3a/pyhanabi/selfplay.py#L254 and https://github.com/facebookresearch/hanabi_SAD/blob/502d6a7a52028511704c944dffe1945194e10c3a/pyhanabi/selfplay.py#L280

I tried and it seems that the memory usage is much more stable. My log looks like this:

available: 487.312 GB, used: 13.205 GB, free: 71.491 GB
available: 458.186 GB, used: 42.330 GB, free: 42.365 GB
available: 444.263 GB, used: 56.253 GB, free: 28.442 GB
available: 442.553 GB, used: 57.964 GB, free: 26.732 GB
available: 439.811 GB, used: 60.706 GB, free: 23.990 GB
available: 437.909 GB, used: 62.607 GB, free: 22.088 GB
available: 435.967 GB, used: 64.550 GB, free: 20.146 GB
available: 434.478 GB, used: 66.039 GB, free: 18.657 GB
available: 433.859 GB, used: 66.658 GB, free: 18.038 GB
available: 430.868 GB, used: 69.648 GB, free: 15.047 GB
available: 427.315 GB, used: 73.202 GB, free: 11.493 GB
available: 426.282 GB, used: 74.234 GB, free: 10.461 GB
available: 425.921 GB, used: 74.596 GB, free: 10.099 GB
available: 425.675 GB, used: 74.842 GB, free: 9.853 GB
available: 425.534 GB, used: 74.983 GB, free: 9.712 GB
available: 425.296 GB, used: 75.220 GB, free: 9.475 GB
available: 425.184 GB, used: 75.333 GB, free: 9.362 GB
available: 425.143 GB, used: 75.374 GB, free: 9.321 GB
available: 425.053 GB, used: 75.464 GB, free: 9.231 GB
available: 424.898 GB, used: 75.618 GB, free: 9.076 GB
available: 424.827 GB, used: 75.689 GB, free: 9.005 GB
available: 424.831 GB, used: 75.685 GB, free: 9.009 GB
available: 424.790 GB, used: 75.727 GB, free: 8.968 GB
available: 424.838 GB, used: 75.679 GB, free: 9.016 GB
available: 424.841 GB, used: 75.675 GB, free: 9.019 GB
available: 424.835 GB, used: 75.681 GB, free: 9.013 GB

There are 1000 gradient steps between each line.

I don't have a better fix for the evaluation for now. But I would suggest disable evaluation at training time, save models per 1K gradient steps and have another script just to evaluate all the models.

akileshbadrinaaraayanan commented 3 years ago

That's a good suggestion, let me try that out :)