Closed timolohrenz closed 2 years ago
Hi,
Thanks for your interest.
update_freq
is always 1 if you use 8 GPUs.max_update
, freeze_finetune_updates
and warmup_steps
were tuned in each fine-tuning setup (e.g., Base vs. Large, the amount of pre-trained data, etc). You could just stick to our provided config files to reproduce the results.Thanks a lot, trainings are done and I was able to (nearly) reproduce the results. For those interested:
BASE Model, 30h FT data:
LARGE Model, 433h FT data:
Finetuning time is about 17h on a single a100 GPU even for the LARGE model with 433h finetuning data. I really appreciate that with the AV-HUBERT approach and your supplied models even universities (as us) without large scale GPU clusters are able to do research in the region of SOTA performance.
Hi,
Thanks for your interest.
- We use 32 GPUs for pre-training. For fine-tuning, we use 8 GPUs. Thus the
update_freq
is always 1 if you use 8 GPUs.- The hyperparameters
max_update
,freeze_finetune_updates
andwarmup_steps
were tuned in each fine-tuning setup (e.g., Base vs. Large, the amount of pre-trained data, etc). You could just stick to our provided config files to reproduce the results.
Hi, here is a thing I wanna ask about max_update here, should max_update be on per-GPU level or should it be divided by the total number of all GPUs for every GPU?
For example, if I set max_updates=80k and the total number of GPUs is 8, then should it be 640k updates across all GPUs (so that 80k updates per GPU) or should the total number of updates still be 80k in total (so that only10k updates per GPU)?
Hi, thanks for providing such extensive code and models for avhubert, setting up the finetuning worked like a charm! 🙏
However I have a few questions towards some of the hyperparameters from the provided configs, as I am in progress of resimulating a lipreading baseline for the VOX pretrained BASE S2S transformer. More specifically I am using the
base_vox_30h.yaml
config file:In the following sections I found some parameters that do not match the values from the paper.
1. Issue https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L18-L23
Here the
distributed_world_size
is set to 8, meaning that finetuning is done using 8 GPUs. In the paper it is said that the BASE setup is trained on 32 GPUs. Is this only true for pretraining and can I assume that finetuning has been done on 8 GPUs? Doesupdate_freq
always stays at[1]
?2. Issue So in the paper in Section A.4 we find this paragraph on finetuning the S2S model:
For the 30h finetuning setup I would assume
max_update: 18000
,freeze_finetune_updates: 18000
, andwarmup_steps:[1800, 5400, 9000]
. In the config we can find the following values:https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L56-L60
https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L95
https://github.com/facebookresearch/av_hubert/blob/cd1fd24e71b18f5c1a7203aec6ce4479a61e7e67/avhubert/conf/finetune/base_vox_30h.yaml#L67-L69
So in order to resimulate your results as close as possible should I stick to the paper or to the provided config files?
Thanks a lot in advance!