Open hnekoeiq opened 4 years ago
Hi @hengyuan-hu , I am encountering similar issues. It will be very useful if you could look into it.
Can you show me the log of 2 or 3 epochs? Is the size of the replay buffer as shown in the log buffer_size: 129310
still increasing?
The memory consumption will increase rapidly at the beginning because it is filling the replay buffer. Once the replay buffer is full, new data will then evict old data and keep the memory consumption roughly unchanged. From our experiments, a full replay buffer with the default size, 2^17, will take up to 200~250GB.
Here is another log. As you can see, the replay buffer size is reached to 2^17 and it's fixed but my job still got killed.
==========
beginning of epoch: 38
available: 292.605 GB, used: 78.168 GB, free: 219.034 GB
EPOCH: 38
Speed: train: 2375.2, act: 9374.6, buffer_add: 139.7, buffer_size: 131074
Total Time: 0H 13M 50S, 830s
Total Sample: train: 1.997M, act: 7.522M
@@@Time
sync and updating : 0 MS, 1.49%
sample data : 0 MS, 0.38%
forward & backward: 48 MS, 90.96%
update model : 3 MS, 7.11%
updating priority : 0 MS, 0.05%
@@@total time per iter: 53.86 ms
[38] Time spent = 21.56 s
38:grad_norm [ 400]: avg: 4.2888, min: 1.4623[ 314], max: 13.4655[ 372]
38:loss [ 400]: avg: 0.9588, min: 0.6525[ 267], max: 1.2350[ 112]
38:rl_loss [ 400]: avg: 0.0884, min: 0.0645[ 186], max: 0.1215[ 329]
worst perf idx inside save is 4
epoch 38, eval score: 2.4590, perfect: 0.00, model saved: True
==========
beginning of epoch: 39
available: 287.828 GB, used: 82.030 GB, free: 214.398 GB
EPOCH: 39
Speed: train: 2382.0, act: 9304.8, buffer_add: 138.8, buffer_size: 131079
Total Time: 0H 14M 11S, 851s
Total Sample: train: 2.048M, act: 7.722M
@@@Time
sync and updating : 0 MS, 1.44%
sample data : 0 MS, 0.39%
forward & backward: 48 MS, 89.56%
update model : 4 MS, 8.56%
updating priority : 0 MS, 0.05%
@@@total time per iter: 53.71 ms
[39] Time spent = 21.50 s
39:grad_norm [ 400]: avg: 4.6096, min: 1.3671[ 52], max: 21.7430[ 64]
39:loss [ 400]: avg: 1.0028, min: 0.6306[ 165], max: 1.4594[ 389]
39:rl_loss [ 400]: avg: 0.0895, min: 0.0631[ 286], max: 0.1158[ 37]
worst perf idx inside save is 1
epoch 39, eval score: 2.7510, perfect: 0.00, model saved: True
==========
beginning of epoch: 40
available: 287.670 GB, used: 81.676 GB, free: 214.120 GB
EPOCH: 40
Speed: train: 2390.4, act: 9386.0, buffer_add: 141.7, buffer_size: 131081
Total Time: 0H 14M 33S, 873s
Total Sample: train: 2.099M, act: 7.923M
@@@Time
sync and updating : 0 MS, 1.42%
sample data : 0 MS, 0.35%
forward & backward: 48 MS, 90.98%
update model : 3 MS, 7.20%
updating priority : 0 MS, 0.05%
@@@total time per iter: 53.52 ms
[40] Time spent = 21.42 s
40:grad_norm [ 400]: avg: 3.7216, min: 1.2324[ 251], max: 15.1563[ 212]
40:loss [ 400]: avg: 1.0099, min: 0.6576[ 108], max: 1.4499[ 0]
40:rl_loss [ 400]: avg: 0.0899, min: 0.0626[ 54], max: 0.1197[ 196]
worst perf idx inside save is 0
epoch 40, eval score: 2.9620, perfect: 0.00, model saved: True
==========
beginning of epoch: 41
available: 287.680 GB, used: 81.702 GB, free: 213.952 GB
EPOCH: 41
Speed: train: 2396.6, act: 9345.0, buffer_add: 140.9, buffer_size: 131080
Total Time: 0H 14M 54S, 894s
Total Sample: train: 2.15M, act: 8.123M
@@@Time
sync and updating : 0 MS, 1.49%
sample data : 0 MS, 0.35%
forward & backward: 48 MS, 90.68%
update model : 3 MS, 7.44%
updating priority : 0 MS, 0.05%
@@@total time per iter: 53.38 ms
[41] Time spent = 21.36 s
41:grad_norm [ 400]: avg: 3.7675, min: 1.3444[ 343], max: 14.8790[ 310]
41:loss [ 400]: avg: 1.0476, min: 0.5923[ 203], max: 1.4457[ 247]
41:rl_loss [ 400]: avg: 0.0916, min: 0.0669[ 142], max: 0.1297[ 185]
worst perf idx inside save is 2
epoch 41, eval score: 2.9120, perfect: 0.00, model saved: True
==========
beginning of epoch: 42
available: 287.150 GB, used: 82.268 GB, free: 213.250 GB
The replay buffer is completely full at the beginning of epoch: 32 available: 292.916 GB, used: 76.393 GB, free: 220.252 GB
. But the memory consumption is still increased after that.
This is similar to what we observed. From our experience, once the replay buffer is full, we observe the memory consumption fluctuate at a small rate. It should not grow infinitely though, we normally train for 3 days and the memory consumption stays relatively flat after the initial several hours.
You may either try to allocate more memory to it (I guess it has not taken over the entire 280GB mem but rather hit a pre-set limit?) or reduce the size of the replay buffer to see if you can then train it for longer. We have trained models using replay buffer of size 10000 (25% smaller), or even 2^16. It should work fine.
There could be many reasons for the memory increasing. For example, the implementation of replay buffer allows extra 25% storage for fast parallel write; the python code also create copies of state_dict to sync models between C++ & Python and it may not get freed right away...
Hi @hengyuan-hu , thanks a lot for getting back and sharing your observations. It is indeed hard to find machines with ~250GB memory especially in an academic setting even after reducing the replay buffer size to 2^16. We find that memory consumption still continues to increase (although slowly) after the replay buffer is full.
Is there any principled way to fix this memory leak without possibly reducing the size of the replay buffer?
Do you still run out of memory even after halving the size of replay buffer?
I am looking deeper into this problem now. It might take some time to figure out and fix.
Hi @hengyuan-hu : Yes, we still run out of memory even after halving the replay buffer size except in cases where we have ~250GB RAM. Some of our experimental settings may end up needing a larger replay buffer size (so not sure if reducing its size is a good solution). From some of my observations looking at the training log for IQL_2p setting with replay buffer size 2^16, used
at the beginning of the experiment is about 37 GB that increases to 69 GB at which point the replay buffer is full and continues to increase till 95 GB at about 2000 epochs. Does this suggest that there are other factors that cause a considerable increase in the used
memory?
Yes I just narrowed down the main cause. It is in the evaluation code that cause the memory increase at the end of each epoch. For a quick fix, you can disable the evaluation here or make it less frequent (increase epoch length) or evaluate on less games, like 100...
I am still figuring out exactly where in eval causes the leak. It seem less obvious...
I pushed a fix to the master. I couldn't find potential places for memory leak and neither could Valgrind. I suspect the reason is fragmentation instead of leaking, i.e. we create/free large chunk of memory from time to time and they are not pinned, which means that the program just keep taking over more memories because previously allocated memory is partially occupied by small data. The fix pushed removes one of the most dominant case of such problem.
This fix will not completely solve everything as we still re-create vector of environments & actors at evaluation time. But with this fix, the memory grows at a more acceptable rate of <2.5G per 500 epochs/evaluations.
The most effective way to save memory is to encode the data in a memory efficient way before saving it to replay buffer and decode it at prefetch time before training. You may want to consider that options as well.
Hi @hengyuan-hu , thanks for providing a fix. How were you using Valgrind to find the cause of the memory leak? In my case, I have Valgrind version 3.15.0 but it always ends up with Illegal instruction (core dumped)
before python script actually starts executing.
Hi @hengyuan-hu , thanks again for providing the fix but just letting you know that there's little to no difference in memory consumption even with the latest commit in master
.
I use the following command for valgrind
valgrind --leak-check=yes python selfplay.py \
--save_dir exps/run2 \
--method iql \
--num_thread 10 \
--num_game_per_thread 1 \
--sad 1 \
--act_base_eps 0.1 \
--act_eps_alpha 7 \
--lr 6.25e-05 \
--eps 1.5e-05 \
--grad_clip 5 \
--gamma 0.999 \
--seed 1 \
--batchsize 128 \
--burn_in_frames 500 \
--replay_buffer_size 2000 \
--epoch_len 40 \
--priority_exponent 0.9 \
--priority_weight 0.6 \
--train_bomb 0 \
--eval_bomb 0 \
--num_player 2 \
--rnn_hid_dim 512 \
--act_device cuda:1 \
--shuffle_color 1 \
Ok I will keep investigating.
Hi @hengyuan-hu , any new causes for memory leak from your investigation? Thanks a lot!
Btw, just noticed an one-line bug here : https://github.com/facebookresearch/hanabi_SAD/blob/master/pyhanabi/eval.py#L23, assert agents is None or runnes is None
to assert agents is None or runners is None
Thanks for the catch. I still don't know the reason yet. I have been playing with some re-structuring to reduce number of unnecessary tensor creations but have not find a solution.
Hi @hengyuan-hu ,
Did you find the fix for this? I must admit that it really gets in the way of running many experiments :(
Have you tried disable evaluation all together? i.e. comment out the code between https://github.com/facebookresearch/hanabi_SAD/blob/502d6a7a52028511704c944dffe1945194e10c3a/pyhanabi/selfplay.py#L254 and https://github.com/facebookresearch/hanabi_SAD/blob/502d6a7a52028511704c944dffe1945194e10c3a/pyhanabi/selfplay.py#L280
I tried and it seems that the memory usage is much more stable. My log looks like this:
available: 487.312 GB, used: 13.205 GB, free: 71.491 GB
available: 458.186 GB, used: 42.330 GB, free: 42.365 GB
available: 444.263 GB, used: 56.253 GB, free: 28.442 GB
available: 442.553 GB, used: 57.964 GB, free: 26.732 GB
available: 439.811 GB, used: 60.706 GB, free: 23.990 GB
available: 437.909 GB, used: 62.607 GB, free: 22.088 GB
available: 435.967 GB, used: 64.550 GB, free: 20.146 GB
available: 434.478 GB, used: 66.039 GB, free: 18.657 GB
available: 433.859 GB, used: 66.658 GB, free: 18.038 GB
available: 430.868 GB, used: 69.648 GB, free: 15.047 GB
available: 427.315 GB, used: 73.202 GB, free: 11.493 GB
available: 426.282 GB, used: 74.234 GB, free: 10.461 GB
available: 425.921 GB, used: 74.596 GB, free: 10.099 GB
available: 425.675 GB, used: 74.842 GB, free: 9.853 GB
available: 425.534 GB, used: 74.983 GB, free: 9.712 GB
available: 425.296 GB, used: 75.220 GB, free: 9.475 GB
available: 425.184 GB, used: 75.333 GB, free: 9.362 GB
available: 425.143 GB, used: 75.374 GB, free: 9.321 GB
available: 425.053 GB, used: 75.464 GB, free: 9.231 GB
available: 424.898 GB, used: 75.618 GB, free: 9.076 GB
available: 424.827 GB, used: 75.689 GB, free: 9.005 GB
available: 424.831 GB, used: 75.685 GB, free: 9.009 GB
available: 424.790 GB, used: 75.727 GB, free: 8.968 GB
available: 424.838 GB, used: 75.679 GB, free: 9.016 GB
available: 424.841 GB, used: 75.675 GB, free: 9.019 GB
available: 424.835 GB, used: 75.681 GB, free: 9.013 GB
There are 1000 gradient steps between each line.
I don't have a better fix for the evaluation for now. But I would suggest disable evaluation at training time, save models per 1K gradient steps and have another script just to evaluate all the models.
That's a good suggestion, let me try that out :)
Hello, Thanks a lot for updating the installation process. I am able to successfully build it. However, when I run
dev.sh
, my job gets killed after some epochs. I made sure that I have 2 gpus, I added cuda:0 and cuda:1 to the script, and I addedexport OMP_NUM_THREADS=1
as well. But the log is still like this:It seems there is a memory leakage somewhere because free memory is reduced quickly after every epoch. Any idea what could be the reason? Thanks in advance for your help.