facebookresearch / GDT

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.
Apache License 2.0
45 stars 4 forks source link

VERY SLOW training on audio-video dataset like kinetics400 and UCF101 #7

Open XinyuSun opened 2 years ago

XinyuSun commented 2 years ago

Hi authors! Thank you for making the paper and code open source. It is very helpful. I am trying to pretrain the GDT model on kinetics400 dataset, while I spent more than 1 day on each epoch. I run on the 8 3090 GPU server and set the batch size on each GPU to 16, and the total batch size is 128, which is a quarter of the original setting in the paper. According to the paper, the authors spent 3 days on pretraining with 512 batch size, under normal circumstances it should not cost more than 3 hours on each epoch. I change the video decode method from pyav to decord, which brings a bit of improvement in training speed. I wonder if the speed of the provided code is tested before release? What should I do to find the cues for speeding up training?

Some logs below:

Epoch: [0]  [  360/14961]  eta: 13:42:52  lr: 0.01  clips/s: 16.263  loss: 2.7961 (2.8411)  batch_t/s: 1.0088 (1.4428)  time: 2.8681  data: 1.3705  max mem: 20040
Epoch: [0]  [  370/14961]  eta: 13:46:51  lr: 0.01  clips/s: 13.694  loss: 2.7992 (2.8464)  batch_t/s: 1.0067 (1.0740)  time: 4.3781  data: 3.3474  max mem: 20040
Epoch: [0]  [  370/14961]  eta: 13:46:48  lr: 0.01  clips/s: 13.769  loss: 2.7919 (2.8454)  batch_t/s: 1.0110 (1.7200)  time: 4.3779  data: 1.3611  max mem: 20040
Epoch: [0]  [  370/14961]  eta: 13:46:48  lr: 0.01  clips/s: 13.532  loss: 2.7913 (2.8402)  batch_t/s: 1.0089 (1.4563)  time: 4.3786  data: 2.4327  max mem: 20040
Epoch: [0]  [  380/14961]  eta: 13:31:23  lr: 0.01  clips/s: 14.072  loss: 2.7891 (2.8451)  batch_t/s: 1.0196 (1.0736)  time: 2.5644  data: 1.5199  max mem: 20040
Epoch: [0]  [  380/14961]  eta: 13:31:20  lr: 0.01  clips/s: 14.029  loss: 2.7738 (2.8434)  batch_t/s: 1.0512 (1.7027)  time: 2.5646  data: 0.5402  max mem: 20040
Epoch: [0]  [  380/14961]  eta: 13:31:19  lr: 0.01  clips/s: 14.026  loss: 2.7874 (2.8387)  batch_t/s: 1.0548 (1.4459)  time: 2.5643  data: 1.0631  max mem: 20040
Epoch: [0]  [  390/14961]  eta: 13:36:54  lr: 0.01  clips/s: 15.097  loss: 2.7765 (2.8417)  batch_t/s: 1.0534 (1.7432)  time: 2.6929  data: 0.5196  max mem: 20040
Epoch: [0]  [  390/14961]  eta: 13:36:56  lr: 0.01  clips/s: 14.988  loss: 2.7927 (2.8441)  batch_t/s: 1.0630 (1.0732)  time: 2.6932  data: 1.6344  max mem: 20040
Epoch: [0]  [  390/14961]  eta: 13:36:53  lr: 0.01  clips/s: 16.121  loss: 2.7775 (2.8376)  batch_t/s: 1.0481 (1.4640)  time: 2.6923  data: 1.0834  max mem: 20040
Epoch: [0]  [  400/14961]  eta: 13:43:48  lr: 0.01  clips/s: 16.551  loss: 2.7957 (2.8433)  batch_t/s: 1.0546 (1.0725)  time: 4.4575  data: 3.4058  max mem: 20040
Epoch: [0]  [  400/14961]  eta: 13:43:45  lr: 0.01  clips/s: 1.458  loss: 2.7986 (2.8373)  batch_t/s: 1.0390 (1.4786)  time: 4.4577  data: 2.3538  max mem: 20040
Epoch: [0]  [  400/14961]  eta: 13:43:46  lr: 0.01  clips/s: 0.679  loss: 2.7963 (2.8410)  batch_t/s: 1.0598 (1.7822)  time: 4.4580  data: 1.1610  max mem: 20040
Epoch: [0]  [  410/14961]  eta: 13:29:18  lr: 0.01  clips/s: 15.575  loss: 2.7954 (2.8418)  batch_t/s: 1.0273 (1.0715)  time: 2.8114  data: 1.7718  max mem: 20040
Epoch: [0]  [  410/14961]  eta: 13:29:15  lr: 0.01  clips/s: 15.525  loss: 2.7892 (2.8399)  batch_t/s: 1.0306 (1.7639)  time: 2.8114  data: 0.6421  max mem: 20040

Sincerely yours.

XinyuSun commented 2 years ago

Avg GPU utilization is relatively low compared with other video pretraining methods

billhhh commented 2 years ago

Thanks

XinyuSun commented 2 years ago

Hi, the author only use the audio model during pretraining, for fair comparison with other SOTAs they did not use audio for finetuning.