wav2vec2.0 issue with decoding finetuned checkpoint

phantomcoder1996 commented 4 years ago

I finetune the wav2vec small model (wav2vec_small.pt) with some data and then when I try to decode the finetuned checkpoint I get this error

INFO:__main__:| decoding with criterion ctc
INFO:__main__:| loading model(s) from /root/decoding/model/checkpoint_best.pt
Traceback (most recent call last):
  File "examples/speech_recognition/infer.py", line 471, in <module>
    cli_main()
  File "examples/speech_recognition/infer.py", line 467, in cli_main
    main(args)
  File "examples/speech_recognition/infer.py", line 280, in main
    model_state=model_state,
  File "examples/speech_recognition/infer.py", line 217, in load_models_and_criterions
    model = task.build_model(cfg.model)
  File "/root/fairseq/fairseq/tasks/fairseq_task.py", line 548, in build_model
    model = models.build_model(args, self)
  File "/root/fairseq/fairseq/models/__init__.py", line 56, in build_model
    return ARCH_MODEL_REGISTRY[cfg.arch].build_model(cfg, task)
  File "/root/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 166, in build_model
    w2v_encoder = Wav2VecEncoder(args, task.target_dictionary)
  File "/root/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 329, in __init__
    args.w2v_path, arg_overrides
  File "/root/fairseq/fairseq/checkpoint_utils.py", line 227, in load_checkpoint_to_cpu
    with open(PathManager.get_local_path(path), "rb") as f:
No such file or directory: '/c/users/username/fairseq/models/wav2vec_small.pt

the path it is complaining about is the path that used to have wav2vec_small.pt during the finetuning. It is mainly the path I used for --w2v-path during finetuning as if it is trying to read the weights from the wav2vec_small.pt without considering the trained weights

What have you tried?

1- Finetune wav2vec_small.pt using my data 2- run the decoding command with checkpoint_best.pt

python examples/speech_recognition/infer.py /root/decoding/testData/ --task audio_pretraining --nbest 1 --path /root/decoding/model/checkpoint_best.pt --gen-subset train --results-path /root/decoding/bestkenlm --w2l-decoder kenlm --lm-model /root/decoding/lm/lm.bin --lm-weight 1.48 --word-score 2.0 --sil-weight 0.0 --criterion ctc --labels ltr --max-tokens 4000000 --lexicon /root/decoding/lm/lexicon.lst --remove-bpe letter --beam 2500 --beam-size-token 25000 --beam-threshold 100

What's your environment?

fairseq Version ( master):
PyTorch Version (e.g., 1.6)
OS (e.g., Linux): linux (but using WSL1)
How you installed fairseq : source
Build command you used (if compiling from source): pip install -e .
Python version: 3.6.9
CUDA/cuDNN version: None
GPU models and configuration: No GPU
Any other relevant information:

spygaurad commented 4 years ago

If i am right then currently it uses wav2vec_small.pt model to define encoder architecture it adds the decoder layer, then it copies the parameters from the checkpoint_best.pt model into the arch. You need both of them for inference.

phantomcoder1996 commented 4 years ago

@spygaurad I thought so at first but what makes me think something is wrong is that the finetuned english models that are on the repo donot produce that error and I think if the checkpoint has the parameters then its state dict shall have the architecture as well

spygaurad commented 4 years ago

@phantomcoder1996 yeah i don't know why it is implemented that way. I finetuned on different language and i too require both .pt files ( the small.pt as well as finetuned.pt) at inference time.

mychiux413 commented 4 years ago

The root cause is: The dict structure from finetuned model is different to the official model like wav2vec_small_960h, at the beginning of inference, the fairseq framework automatically loads those training parameters according to your finetuned model's dict structure.

Here is the inspection and the temp solution below:

import torch
finetune_path = '/your/finetune/path.pt'
official_path = '/the/official/path/wav2vec_small_960h.pt'
fixed_model_path = '/path/to/fixed_model.pt'

finetune = torch.load(finetune_path, map_location=torch.device('cpu'))
official = torch.load(official_path, map_location=torch.device('cpu'))
print('finetune keys:', finetune.keys(), 'official keys', official.keys())

# the args in finetune is "None"
print('finetune args:', finetune['args'], 'official args: ', official['args'])

# To temporary solve it, we can imitate the args from official model
finetune.pop('cfg')
finetune['args'] = official['args']

torch.save(finetune, fixed_model_path)

phantomcoder1996 commented 4 years ago

@mychiux413 thank you for the quick solution.

mychiux413 commented 4 years ago

This temp solution might make the fixed model can NOT resume the training, because we drop those training configures to make it independent, so don't forget to backup one. Fix the inference code should be a better idea.

wahyubram82 commented 4 years ago

I open the similar issue in #2828, I alredy investigate..., it because the tourch cannot read the model .pt file. until now is no answer why it's happen..., but i think, base on name of the function that run load_checkpoint_to_cpu, maybe... just maybe ... error because we train it in gpu meanwhile fairseq try to open with cpu...

the stranger things is.... the model that i use is not finish from creating the best pre-trainned mode. and if i continue to train to make pre-trainned model, it's not a matter..., i train it in google colab, so every 12 hrs i must set new virtual machine to continues train the dataset.

but, right now, I'm prepare to set the next step. so I try test the next step, the finetuning step, no matter i use wav2vec 2.0 base model or vq-wav2vec model, it's cannot load the model.

I already follow the instruction that given by @mychiux413, the problem is my pre-trainned model is create from my own dataset and the result of args in my model is None. and I cannot start at all any fine tuning process.

import torch
model_path = 'checkpoint_best.pt'
mymodel = torch.load(model_path, map_location=torch.device('cpu'))

then..

print('mymodel args: ', mymodel['args'])

mymodel args: None

but in cfg

print('mymodel cfg: ', mymodel['cfg'])


mymodel cfg: validate_interval_updates': 0, 'validate_after_updates': 0}, 'optimization': {'max_epoch': 0, 'max_update': 400000, 'clip_norm': 25.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'min_lr': -1.0, 'use_bmuf': False, 'stop_time_hours': 0}, 'checkpoint': {'save_dir': '/content/drive/My Drive/wav2vec_v2_pre_train_model', 'restore_file': 'checkpoint_last.pt', 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'finetune_from_model': None, 'checkpoint_shard_count': 1, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'block_lr': 1, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'task': Namespace(_name='audio_pretraining', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='wav2vec2', attention_dropout=0.1, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=25.0, codebook_negatives=0, conv_bias=False, conv_feature_layers='[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2', conv_pos=128, conv_pos_groups=16, cpu=False, criterion='wav2vec', cross_sample_negatives=0, curriculum=0, data='/content/drive/My Drive/wav_manifest/', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, dropout_features=0.1, dropout_input=0.1, empty_cache_freq=0, enable_padding=False, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0.05, encoder_layers=12, end_learning_rate=0.0, eos=2, eval_wer=False, eval_wer_remove_bpe='letter', extractor_mode='default', fast_stat_sync=False, feature_grad_mult=0.1, final_dim=256, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', infonce=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, labels=None, latent_dim=0, latent_groups=2, latent_temp='(2,0.5,0.999995)', latent_vars=320, layer_norm_first=False, local_rank=0, localsgd_frequency=3, log_format=None, log_interval=100, log_keys='["prob_perplexity","code_perplexity","temp"]', logit_temp=0.1, loss_weights='[0.1, 10]', lr=[0.0005], lr_scheduler='polynomial_decay', mask_channel_length=10, mask_channel_min_space=1, mask_channel_other=0, mask_channel_prob=0, mask_channel_selection='static', mask_length=10, mask_min_space=1, mask_other=0.0, mask_prob=0.65, mask_selection='static', max_epoch=0, max_sample_size=1500000, max_tokens=1400000, max_tokens_valid=1400000, max_update=400000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, min_sample_size=2000, model_parallel_size=1, negatives_from_everywhere=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_mask_channel_overlap=False, no_mask_overlap=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_negatives=100, num_shards=1, num_workers=128, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, power=1.0, profile=False, quantization_config_path=None, quantize_input=False, quantize_targets=True, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', same_quantizer=False, sample_rate=16000, save_dir='/content/drive/My Drive/wav2vec_v2_pre_train_model', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, target_glu=False, task='audio_pretraining', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, total_num_update=400000, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=32000, weight_decay=0.01, zero_sharding='none'), 'optimizer': {'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.01, 'use_old_adam': False, '_name': 'adam', 'tpu': False, 'lr': [0.0005]}, 'common_eval': {'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'generation': {'beam': 5, 'nbest': 1, 'max_len_a': 0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1, 'unkpen': 0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': False, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'buffer_size': 0, 'input': '-'}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'tokenizer': None, 'bpe': None, 'criterion': Namespace(_name='wav2vec', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='wav2vec2', attention_dropout=0.1, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=25.0, codebook_negatives=0, conv_bias=False, conv_feature_layers='[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2', conv_pos=128, conv_pos_groups=16, cpu=False, criterion='wav2vec', cross_sample_negatives=0, curriculum=0, data='/content/drive/My Drive/wav_manifest/', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, dropout_features=0.1, dropout_input=0.1, empty_cache_freq=0, enable_padding=False, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0.05, encoder_layers=12, end_learning_rate=0.0, eos=2, eval_wer=False, eval_wer_remove_bpe='letter', extractor_mode='default', fast_stat_sync=False, feature_grad_mult=0.1, final_dim=256, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', infonce=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, labels=None, latent_dim=0, latent_groups=2, latent_temp='(2,0.5,0.999995)', latent_vars=320, layer_norm_first=False, local_rank=0, localsgd_frequency=3, log_format=None, log_interval=100, log_keys='["prob_perplexity","code_perplexity","temp"]', logit_temp=0.1, loss_weights='[0.1, 10]', lr=[0.0005], lr_scheduler='polynomial_decay', mask_channel_length=10, mask_channel_min_space=1, mask_channel_other=0, mask_channel_prob=0, mask_channel_selection='static', mask_length=10, mask_min_space=1, mask_other=0.0, mask_prob=0.65, mask_selection='static', max_epoch=0, max_sample_size=1500000, max_tokens=1400000, max_tokens_valid=1400000, max_update=400000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, min_sample_size=2000, model_parallel_size=1, negatives_from_everywhere=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_mask_channel_overlap=False, no_mask_overlap=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_negatives=100, num_shards=1, num_workers=128, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, power=1.0, profile=False, quantization_config_path=None, quantize_input=False, quantize_targets=True, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', same_quantizer=False, sample_rate=16000, save_dir='/content/drive/My Drive/wav2vec_v2_pre_train_model', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, target_glu=False, task='audio_pretraining', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, total_num_update=400000, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=32000, weight_decay=0.01, zero_sharding='none'), 'lr_scheduler': Namespace(_name='polynomial_decay', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='wav2vec2', attention_dropout=0.1, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=25.0, codebook_negatives=0, conv_bias=False, conv_feature_layers='[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2', conv_pos=128, conv_pos_groups=16, cpu=False, criterion='wav2vec', cross_sample_negatives=0, curriculum=0, data='/content/drive/My Drive/wav_manifest/', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, dropout_features=0.1, dropout_input=0.1, empty_cache_freq=0, enable_padding=False, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0.05, encoder_layers=12, end_learning_rate=0.0, eos=2, eval_wer=False, eval_wer_remove_bpe='letter', extractor_mode='default', fast_stat_sync=False, feature_grad_mult=0.1, final_dim=256, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', infonce=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, labels=None, latent_dim=0, latent_groups=2, latent_temp='(2,0.5,0.999995)', latent_vars=320, layer_norm_first=False, local_rank=0, localsgd_frequency=3, log_format=None, log_interval=100, log_keys='["prob_perplexity","code_perplexity","temp"]', logit_temp=0.1, loss_weights='[0.1, 10]', lr=[0.0005], lr_scheduler='polynomial_decay', mask_channel_length=10, mask_channel_min_space=1, mask_channel_other=0, mask_channel_prob=0, mask_channel_selection='static', mask_length=10, mask_min_space=1, mask_other=0.0, mask_prob=0.65, mask_selection='static', max_epoch=0, max_sample_size=1500000, max_tokens=1400000, max_tokens_valid=1400000, max_update=400000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, min_sample_size=2000, model_parallel_size=1, negatives_from_everywhere=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_mask_channel_overlap=False, no_mask_overlap=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_negatives=100, num_shards=1, num_workers=128, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, power=1.0, profile=False, quantization_config_path=None, quantize_input=False, quantize_targets=True, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', same_quantizer=False, sample_rate=16000, save_dir='/content/drive/My Drive/wav2vec_v2_pre_train_model', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, target_glu=False, task='audio_pretraining', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, total_num_update=400000, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=32000, weight_decay=0.01, zero_sharding='none'), 'model': Namespace(_name='wav2vec2', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='wav2vec2', attention_dropout=0.1, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=25.0, codebook_negatives=0, conv_bias=False, conv_feature_layers='[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2', conv_pos=128, conv_pos_groups=16, cpu=False, criterion='wav2vec', cross_sample_negatives=0, curriculum=0, data='/content/drive/My Drive/wav_manifest/', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, dropout_features=0.1, dropout_input=0.1, empty_cache_freq=0, enable_padding=False, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0.05, encoder_layers=12, end_learning_rate=0.0, eos=2, eval_wer=False, eval_wer_remove_bpe='letter', extractor_mode='default', fast_stat_sync=False, feature_grad_mult=0.1, final_dim=256, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', infonce=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, labels=None, latent_dim=0, latent_groups=2, latent_temp='(2,0.5,0.999995)', latent_vars=320, layer_norm_first=False, local_rank=0, localsgd_frequency=3, log_format=None, log_interval=100, log_keys='["prob_perplexity","code_perplexity","temp"]', logit_temp=0.1, loss_weights='[0.1, 10]', lr=[0.0005], lr_scheduler='polynomial_decay', mask_channel_length=10, mask_channel_min_space=1, mask_channel_other=0, mask_channel_prob=0, mask_channel_selection='static', mask_length=10, mask_min_space=1, mask_other=0.0, mask_prob=0.65, mask_selection='static', max_epoch=0, max_sample_size=1500000, max_tokens=1400000, max_tokens_valid=1400000, max_update=400000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1.0, min_sample_size=2000, model_parallel_size=1, negatives_from_everywhere=False, no_epoch_checkpoints=True, no_last_checkpoints=False, no_mask_channel_overlap=False, no_mask_overlap=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, normalize=False, nprocs_per_node=1, num_negatives=100, num_shards=1, num_workers=128, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, power=1.0, profile=False, quantization_config_path=None, quantize_input=False, quantize_targets=True, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', same_quantizer=False, sample_rate=16000, save_dir='/content/drive/My Drive/wav2vec_v2_pre_train_model', save_interval=1, save_interval_updates=0, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, target_glu=False, task='audio_pretraining', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, total_num_update=400000, tpu=False, train_subset='train', unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=32000, weight_decay=0.01, zero_sharding='none')}

did i must use some option in create own pre-trainned model? to make it can be finetuning?

what i must do?

please ... need an advice...

EDIT: I already got the answer..what should we do if get the problem like me...

refer to @mychiux413, after download 3.15 GB pre-trainned model to know what the content of the model 'args' and follow the process (pipeline) or debug the train porcess...

I don't know why when we create own pre-trainned model from own dataset..., the model that create with the similar command like in README.md, the model that created with that command does not have 'args' inside that model, so it will giving a trouble in continue to the next step, that is finetuning process.

so to solve it:

we must record the command that we use to ceate our own pre-trainned model, for example, mine use this command:

!python3 /content/repo/fairseq/train.py '/content/drive/My Drive/wav_manifest/' \
--save-dir '/content/drive/My Drive/wav2vec_v2_pre_train_model'  --fp16 --num-workers 128 \
--task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam --adam-betas '(0.9,0.98)' \
--adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 --lr 0.0005 --warmup-updates 32000 \
--mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 --encoder-layerdrop 0.05 \
--dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 --loss-weights '[0.1, 10]' --conv-pos 128 \
--conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 --max-sample-size 1500000 \
--no-epoch-checkpoints --min-sample-size 2000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1500000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

we use python to reproduce a litlle process when trainning that pre-trainned model to get the args from that command. the script to do that:

import torch, argparse, logging, math, os, random, sys, numpy as np
from fairseq import options

#create argument manually base on our command on create pre-trainned model.
sys.argv = ['/content/repo/fairseq/train.py' ,
  '/content/drive/My Drive/wav_manifest/',
  '--save-dir',
  '/home/bram/Documents/coding/speech/traindata/cvdata/ori2/model/wav2vec2l',  
  '--fp16',
   '--num-workers',
   '128',
   '--task',
   'audio_pretraining',
   '--criterion',
   'wav2vec',
   '--arch',
   'wav2vec2',
   '--log-keys',
   '["prob_perplexity","code_perplexity","temp"]',
   '--quantize-targets',
   '--extractor-mode',
   'default',
   '--conv-feature-layers',
   '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2',
   '--final-dim',
   '256',
   '--latent-vars',
   '320',
   '--latent-groups',
   '2',
   '--latent-temp',
   '(2,0.5,0.999995)',
   '--infonce',
   '--optimizer',
   'adam',
   '--adam-betas',
   '(0.9,0.98)',
   '--adam-eps',
   '1e-06',
   '--lr-scheduler',
   'polynomial_decay',
   '--total-num-update',
   '400000',
   '--lr',
   '0.0005',
   '--warmup-updates',
   '32000',
   '--mask-length',
   '10',
   '--mask-prob',
   '0.65',
   '--mask-selection',
   'static',
   '--mask-other',
   '0',
   '--encoder-layerdrop',
   '0.05',
   '--dropout-input',
   '0.1',
   '--dropout-features',
   '0.1',
   '--feature-grad-mult',
   '0.1',
   '--loss-weights',
   '[0.1, 10]',
   '--conv-pos',
   '128',
   '--conv-pos-groups',
   '16',
   '--num-negatives',
   '100',
   '--cross-sample-negatives',
   '0',
   '--max-sample-size',
   '1500000',
   '--no-epoch-checkpoints',
   '--min-sample-size',
   '2000',
   '--dropout',
   '0.1',
   '--attention-dropout',
   '0.1',
   '--weight-decay',
   '0.01',
   '--max-tokens',
   '1500000',
   '--max-update',
   '400000',
   '--skip-invalid-size-inputs-valid-test',
   '--ddp-backend',
   'no_c10d'             
           ]

#do several process, get this pipeline order from fairseq train.py
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level=os.environ.get("LOGLEVEL", "INFO").upper(),
    stream=sys.stdout,
)
logger = logging.getLogger("fairseq_cli.train")

parser = options.get_training_parser()

#the args should exists in model:

args = options.parse_args_and_arch(parser, modify_parser=None)

# --------
#Note: I understand this args similar after download 3.15 GB wav2vec 2.0 shared pre-trainned model 
# ---------

# in this step, we already have args that should be exists in own pre-trainned model (inside args variabel).

# now we should save it to the model that already create it. I will save it to checkpoint_best.pt, that i will use to fine tuning.
# I use step that show by @mychiux413 to put the args to model.

model_path = 'checkpoint_best.pt'
fixed_model = 'fixed_model.pt'
mymodel = torch.load(model_path, map_location=torch.device('cpu'))
mymodel['args'] = args
torch.save(mymodel, fixed_model)

after run the script above, we have the model that have args that preventing from error because the error before, resulting by the result of torch reading process that try args from the model but get result None. so with the existing args, it will prevent error.

for the next step, the finetuning process, be carefull with the apex installation method, make sure you not install apex with pip3 install ., use this command to install apex: pip3 install -e "/content/apex" --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

alexeib commented 3 years ago

this should have been fixed some time ago - the args saved in finetuned checkpoint get updated to not point at the pretrained wav2vec checkpoint. might not be the case for checkpoints finetuned in the past though

facebookresearch / fairseq

wav2vec2.0 issue with decoding finetuned checkpoint #2803

What have you tried?

What's your environment?