facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.17k stars 6.37k forks source link

VQ-wav2vec: What is the maximum length of the sequence when training tokenized audio with BERT architectures? #1929

Closed shamanez closed 4 years ago

shamanez commented 4 years ago

In usual wav2vec training, it says max audio length is 150000, which creates 935 discrete representations. I am still not clear what is the maximum sequence length of the transformer architectures that used to train the tokenized audio with BERT tasks.

shamanez commented 4 years ago

I checked in the checkpoint. It says max positions are 2048.

Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, arch='roberta_base', attention_dropout=0.1, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='masked_lm', curriculum=0, data='/checkpoint/stes/asr/latent_variables_vae/latents_vae_full_large_050919/190905223551798386820/checkpoint-190909020357-334272527/roberta-libri960', dataset_impl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://learnfair5025:55498', distributed_no_spawn=True, distributed_port=55498, distributed_rank=0, distributed_world_size=128, dropout=0.1, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layers=12, end_learning_rate=0.0, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, freq_weighted_replacement=False, keep_interval_updates=1, keep_last_epochs=-1, leave_unmasked_prob=0.1, log_format='json', log_interval=500, lr=[0.0005], lr_scheduler='polynomial_decay', mask_multiple_length=10, mask_prob=0.5, mask_whole_words=False, max_epoch=0, ### max_positions=2048, max_sentences=2, max_sentences_valid=2, max_tokens=None, max_tokens_valid=None, max_update=250000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=4, optimizer='adam', optimizer_overrides='{}', pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, preload_codebook='', random_token_prob=0.1, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode='complete', save_dir='/checkpoint/stes/asr/latent_variables_vae/latents_vae_full_large_050919/190905223551798386820/checkpoint-190909020357-334272527/roberta-libri960/testing/run-sweep-bert_large.fp16.bm_complete.tps2048.roberta_base.adam.b2_0.98.eps1e-06.cl0.0.lr0.0005.wu10000.mask10.mprob0.5.wd0.01.bsz2.uf1.mu250000.s5.ngpu128', save_interval=1, save_interval_updates=25000, seed=5, sentence_avg=False, skip_invalid_size_inputs_valid_test=True, task='masked_lm', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tokens_per_sample=2048, total_num_update=250000, train_subset='train', update_freq=[1], use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_updates=10000, weight_decay=0.01)

leo19941227 commented 2 years ago

Hi @shamanez, @david-macleod and @alexeib,

Many thanks for the helpful discussion on how to extract RoBERTa features from vq-wav2vec!

Recently I encountered tokens exceeds maximum length: 2862 > 2048 error when extracting features on LibriSpeech. The error was raised at

https://github.com/pytorch/fairseq/blob/0dfd6b624081fc4e1c72fc74ae0cd2de199c334c/fairseq/models/roberta/hub_interface.py#L88-L92

According to the above example, when the waveform has 150000 samples (about 9 seconds), vq-wav2vec creates 935 discrete representations. However, there are utterances in LibriSpeech longer than 27 seconds, so vq-wav2vec will produce sequences at least longer than 935 * 3 = 2805 timestamps.

However, in #2197, we know that RoBERTa was pre-trained without cropping the utterances, so I am wondering is it normal I encounter this error on LibriSpeech? I though the minimum supported sequence length of the released RoBERTa would be long enough to extract features for the entire LibriSpeech since it is pre-trained on LibriSpeech 960 hours.

Please let me know if I am wrong about anything! Thanks!