facebookresearch / av_hubert

A self-supervised learning framework for audio-visual speech
Other
856 stars 137 forks source link

Finetuning parameter mismatch between paper and configs #33

Closed timolohrenz closed 2 years ago

timolohrenz commented 2 years ago

Hi, thanks for providing such extensive code and models for avhubert, setting up the finetuning worked like a charm! 🙏

However I have a few questions towards some of the hyperparameters from the provided configs, as I am in progress of resimulating a lipreading baseline for the VOX pretrained BASE S2S transformer. More specifically I am using the base_vox_30h.yaml config file:

In the following sections I found some parameters that do not match the values from the paper.

1. Issue https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L18-L23

Here the distributed_world_size is set to 8, meaning that finetuning is done using 8 GPUs. In the paper it is said that the BASE setup is trained on 32 GPUs. Is this only true for pretraining and can I assume that finetuning has been done on 8 GPUs? Does update_freq always stays at [1] ?

2. Issue So in the paper in Section A.4 we find this paragraph on finetuning the S2S model:

In S2S, the pre-trained model (i.e., encoder) is frozen for the first N% updates . N is 100 and 50 for 30h and 433h labeled setting respectively. The entire model is trained for 18K/45K steps in the 30h/433h setting. Both models are trained with Adam, with the learning rate being warmed up for the first P % of updates to a peak of 0.001 and linearly decayed. P is tuned among {10, 30, 50}. All hyperparamters are tuned on the validation set.

For the 30h finetuning setup I would assume max_update: 18000, freeze_finetune_updates: 18000, and warmup_steps:[1800, 5400, 9000]. In the config we can find the following values:

https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L56-L60

https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L95

https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L67-L69

So in order to resimulate your results as close as possible should I stick to the paper or to the provided config files?

Thanks a lot in advance!

chevalierNoir commented 2 years ago

Hi,

Thanks for your interest.

  1. We use 32 GPUs for pre-training. For fine-tuning, we use 8 GPUs. Thus the update_freq is always 1 if you use 8 GPUs.
  2. The hyperparameters max_update, freeze_finetune_updates and warmup_steps were tuned in each fine-tuning setup (e.g., Base vs. Large, the amount of pre-trained data, etc). You could just stick to our provided config files to reproduce the results.
timolohrenz commented 2 years ago

Thanks a lot, trainings are done and I was able to (nearly) reproduce the results. For those interested:

BASE Model, 30h FT data:

LARGE Model, 433h FT data:

Finetuning time is about 17h on a single a100 GPU even for the LARGE model with 433h finetuning data. I really appreciate that with the AV-HUBERT approach and your supplied models even universities (as us) without large scale GPU clusters are able to do research in the region of SOTA performance.

PussyCat0700 commented 10 months ago

Hi,

Thanks for your interest.

  1. We use 32 GPUs for pre-training. For fine-tuning, we use 8 GPUs. Thus the update_freq is always 1 if you use 8 GPUs.
  2. The hyperparameters max_update, freeze_finetune_updates and warmup_steps were tuned in each fine-tuning setup (e.g., Base vs. Large, the amount of pre-trained data, etc). You could just stick to our provided config files to reproduce the results.

Hi, here is a thing I wanna ask about max_update here, should max_update be on per-GPU level or should it be divided by the total number of all GPUs for every GPU?

For example, if I set max_updates=80k and the total number of GPUs is 8, then should it be 640k updates across all GPUs (so that 80k updates per GPU) or should the total number of updates still be 80k in total (so that only10k updates per GPU)?