codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

Very low GPU usage when training on 8 GPU in a single machine #44

Closed mapingshuo closed 5 years ago

mapingshuo commented 5 years ago

Hi, I am currently pretaining the BERT on my own data. I use the alpha0.0.1a5 branch (newest version).
I found only 20% of the GPU is in use.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |  10296MiB / 16152MiB |     32%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   37C    P0    55W / 300W |   2742MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   40C    P0    58W / 300W |   2742MiB / 16152MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   47C    P0    61W / 300W |   2742MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   36C    P0    98W / 300W |   2742MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   38C    P0    88W / 300W |   2736MiB / 16152MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   48C    P0    80W / 300W |   2736MiB / 16152MiB |     25%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   46C    P0    71W / 300W |   2736MiB / 16152MiB |     24%      Default |
+-------------------------------+----------------------+----------------------+

I am not familiar with pytorch. Any one konws why?

codertimo commented 5 years ago

I think you should adjust the batch_size for full gpu usage. -b or --batch_size option can let you change the batch_size. :) and it seems 2742MiB is the dataset and 10296 - 2742 =7,554 Mib would be the main model size. So... I guess you can size up your data batch_size into 3time more large.

jiqiujia commented 5 years ago

Wow. 8 V100... I think the problem is with this line: device = torch.device("cuda:0" if cuda_condition else "cpu") Only one gpu is used in this situation

codertimo commented 5 years ago

@mapingshuo BTW... what a beautiful experimental device.. V100 x8...

codertimo commented 5 years ago

@jiqiujia well DataParallel automatically distributed to each GPU as you can see the 1-7 gpu memory usage

mapingshuo commented 5 years ago

@codertimo I modified the batch_size from 64 to 128, this helps a little. When I tried a bigger batch_size (192), I get the Out Of Memory error. I'm trying to figure out the OOM problem now :)

codertimo commented 5 years ago

hah @mapingshuo sounds like your adding new GPU device

mapingshuo commented 5 years ago

1 GPU is quicker than 8 GPU in parallel, funny.

mapingshuo commented 5 years ago

Oh, I found my vocab is too big (300,000) and I got 160 million parameters. Then I tokenized my original text and set min_freq as 5 when generating vocabulary. Now, the vocab_size became 30,000 and the parameters num became 22 million. The training is much quicker, because I can modified the batch_size from 128 to 1024.

Maybe you can warn people not to generate vocabulary with plain text, haha.

Still, only 10% ~ 20% of the GPU is in use. Confused.

codertimo commented 5 years ago

@mapingshuo maybe it's time to increase the batch_size as fully as possible to fit in GPU memory

wq343580510 commented 5 years ago

Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90% image

I know the trick from a newbee colleague.

harsha-sharechat-account commented 4 years ago

1 GPU is quicker than 8 GPU in parallel, funny.

hi can you pls reply on how to overcome that because it is the problem that i have been facing as well

mapingshuo commented 4 years ago

1 GPU is quicker than 8 GPU in parallel, funny.

hi can you pls reply on how to overcome that because it is the problem that i have been facing as well

Sorry, I forgot...

Huntersxsx commented 4 years ago

Try to move the criterion function in the forward function in bert.py and just return the loss of masked language model. (the output of the encoder is not used in the back),The speed will be more than doubled. and the use of gpu will be more than 70%-90% image

I know the trick from a newbee colleague.

Sorry, I met OOM problem, but I don't understand what you meant exactly, is there something different between your screenshot and the author's origin code?