about training speed - Githubissues

LQchen1 commented 3 weeks ago

Thank you for your excellent work, I am trying to reproduce your code, I ran MLLA-T on A100*8 card and found that it takes about 2 hours to run an epoch, May I ask if this training speed is normal? It currently only uses 24GB of GPU memory, and even if I increase batch_size, I think this time is still unacceptable.

tian-qing001 commented 3 weeks ago

Hi @LQchen1, thanks for your interest in our work.

I believe there is something wrong. When I train MLLA-T using 8*RTX3090, it takes about 10 minutes to finish one epoch. It may take less than 10 minutes to train MLLA-T for one epoch on 8*A100.

[2024-06-11 01:26:48 mlla_tiny] (main.py 185): INFO Start training
[2024-06-11 01:27:42 mlla_tiny] (main.py 296): INFO Train: [1/300][100/1251]    eta 0:10:18 lr 0.000005 time 0.4640 (0.5371)    loss 6.9216 (6.9214)    grad_norm 0.4345 (0.4588)   mem 14932MB
[2024-06-11 01:28:29 mlla_tiny] (main.py 296): INFO Train: [1/300][200/1251]    eta 0:08:50 lr 0.000009 time 0.5026 (0.5040)    loss 6.9210 (6.9169)    grad_norm 0.4001 (0.4394)   mem 14932MB
[2024-06-11 01:29:17 mlla_tiny] (main.py 296): INFO Train: [1/300][300/1251]    eta 0:07:50 lr 0.000013 time 0.4652 (0.4941)    loss 6.9226 (6.9148)    grad_norm 0.4002 (0.4274)   mem 14932MB
[2024-06-11 01:30:04 mlla_tiny] (main.py 296): INFO Train: [1/300][400/1251]    eta 0:06:56 lr 0.000017 time 0.4729 (0.4890)    loss 6.9135 (6.9124)    grad_norm 0.3630 (0.4166)   mem 14932MB
[2024-06-11 01:30:51 mlla_tiny] (main.py 296): INFO Train: [1/300][500/1251]    eta 0:06:05 lr 0.000021 time 0.4781 (0.4857)    loss 6.9038 (6.9103)    grad_norm 0.4079 (0.4092)   mem 14932MB
[2024-06-11 01:31:39 mlla_tiny] (main.py 296): INFO Train: [1/300][600/1251]    eta 0:05:15 lr 0.000025 time 0.4626 (0.4837)    loss 6.8826 (6.9072)    grad_norm 0.4651 (0.4137)   mem 14932MB
[2024-06-11 01:32:26 mlla_tiny] (main.py 296): INFO Train: [1/300][700/1251]    eta 0:04:26 lr 0.000029 time 0.4630 (0.4820)    loss 6.8112 (6.9009)    grad_norm 0.6705 (0.4444)   mem 14932MB
[2024-06-11 01:33:13 mlla_tiny] (main.py 296): INFO Train: [1/300][800/1251]    eta 0:03:37 lr 0.000033 time 0.4755 (0.4810)    loss 6.8075 (6.8910)    grad_norm 1.0771 (0.5062)   mem 14932MB
[2024-06-11 01:34:01 mlla_tiny] (main.py 296): INFO Train: [1/300][900/1251]    eta 0:02:48 lr 0.000037 time 0.4675 (0.4801)    loss 6.8490 (6.8783)    grad_norm 1.4148 (0.5976)   mem 14932MB
[2024-06-11 01:34:48 mlla_tiny] (main.py 296): INFO Train: [1/300][1000/1251]   eta 0:02:00 lr 0.000041 time 0.4878 (0.4793)    loss 6.7347 (6.8658)    grad_norm 1.6200 (0.6977)   mem 14932MB
[2024-06-11 01:35:36 mlla_tiny] (main.py 296): INFO Train: [1/300][1100/1251]   eta 0:01:12 lr 0.000045 time 0.4628 (0.4792)    loss 6.7812 (6.8526)    grad_norm 2.6039 (0.7968)   mem 14932MB
[2024-06-11 01:36:23 mlla_tiny] (main.py 296): INFO Train: [1/300][1200/1251]   eta 0:00:24 lr 0.000049 time 0.4653 (0.4788)    loss 6.7859 (6.8408)    grad_norm 2.3491 (0.8946)   mem 14932MB
[2024-06-11 01:36:48 mlla_tiny] (main.py 304): INFO EPOCH 1 training takes 0:09:59

tian-qing001 commented 3 weeks ago

I think your GPUs may be stuck by something.

53ebfb900b4bd61d2d2604adab4839a

Additionally, you can turn on --amp to save GPU memory and speed up training.

LQchen1 commented 3 weeks ago

When I ran the other code (Vmamba), I didn't find anything wrong with the server, sorry, I didn't mean that there was something wrong with your code, I wanted to reinstall the conda environment and try again, to be honest, I didn't exactly install the environment as required by your version. If there is a result I will report it. Of course, I would be grateful if you could provide a detailed configuration of the environment. Here's part of my environment setup:

tian-qing001 commented 3 weeks ago

Hi @LQchen1, it seems that the initial batches are experiencing significant delays, resulting in a long estimated time. Perhaps you can wait until the first epoch is completed to see how long it really takes.

LQchen1 commented 3 weeks ago

@tian-qing001 Thanks for your advice, I will try to keep it running, I guess the possible reason is that the data set is not stored on the local server, but on a shared server, but I did not find such a long initial batch time when running other code

tian-qing001 commented 3 weeks ago

And I noticed that the batch size you used when training VMamba is 4x that of training MLLA. It might be helpful to increase the batch size, i.e. --amp --batch-size 512.

LQchen1 commented 3 weeks ago

@tian-qing001 Yes, increasing the batch will reduce the time, but the initial epoch of Vmamba only takes about 15 minutes, which is 8× acceleration of the current code. This is obviously not a problem of increasing the batch, I will keep the same batch to run it to further troubleshoot the problem, and use --amp.

tian-qing001 commented 3 weeks ago

@LQchen1 Thank you for trying.

LQchen1 commented 3 weeks ago

@tian-qing001 Hello, tian, I used two A100 cards to test their training speed with a batch size of 512. Here are the results: Vmamba： MLLA（use --amp）： I am very sad that I can't train MLLA fast, I will try to change the torch version next, is it possible that this is the reason for the slow MLLA training?

tian-qing001 commented 3 weeks ago

Hi @LQchen1. It seems there might be an issue with the data loading process because your dataset isn't stored locally. I believe it is not possible for the model to run such slow with any version of PyTorch. I also noticed you had a similar problem when training VMamba a few days ago. How did you fix that issue?

LQchen1 commented 3 weeks ago

@tian-qing001 When I first did load slowly, I tried to use Vmamba advice, if found that training was slow in Vmamba, The cudnn acceleration program is forbidden, like this: torch. Backends.cudnn. Enabled = False torch.backends.cudnn.benchmark = False The torch. Backends. cudnn. Deterministic = False, but later I found after using cudnn, that the program can also be normal executive, that is to say, in fact, I didn't do anything. May be I able to solve this problem by transferring the data locally now. I will try. Thank you for your suggestion.

LeapLabTHU / MLLA

about training speed #5