Can not successfully run the evaluation script of speech_text_joint_to_text pre-training code

czy97 commented 2 years ago

🐛 Bug

The evaluation code for the Librispeech ASR Pre-training in https://github.com/facebookresearch/fairseq/blob/main/examples/speech_text_joint_to_text/docs/pre-training.md seems not to be well tested

To Reproduce

Command: python ./fairseq_cli/generate.py \ $S2T_DATA_PATH \ --task speech_text_joint_to_text \ --max-tokens 800000 \ --max-source-positions 800000 \ --nbest 1 \ --results-path $SAVE_PATH \ --batch-size 512 \ --path $FINAL_MODEL \ --gen-subset $SUBSET \ --config-yaml config.yaml \ --scoring wer \ --beam 10 --lenpen 1.0 --user-dir examples/speech_text_joint_to_text

The evaluation command for Librispeech ASR Pre-training has an error. I think we should add "--user-dir" before "examples/speech_text_joint_to_text"
After fixing the above issue, I directly evaluate the Fine-tuned model provided in https://github.com/facebookresearch/fairseq/blob/main/examples/speech_text_joint_to_text/docs/pre-training.md. I got another error: "OSError: Model file not found: /fsx/yuntang/2021/joint_pretraining_ASR/pretrain03/checkpoints/expt10_960h.wd0.01.config.neuu.lr_0.001.elr_1e-06.mu800.0k.uf6.bs200.msp1024.mtp1024.mtt3072.mspch600.0k.mass750.0k.miss64.0k.mst750.0k.dsb3.mask0.3.mr0.1.ssmp0.3.sump0.7.mwd.noscale.gelu.default.all.nb.lpos.dp0.1.bart.ngpu16/checkpoint6.pt"
Then, I directly evaluate fine-tuned model trained by myself. Then I get the following error: Traceback (most recent call last): File "./fairseq_cli/generate.py", line 417, in cli_main() File "./fairseq_cli/generate.py", line 413, in cli_main main(args) File "./fairseq_cli/generate.py", line 48, in main return _main(cfg, h) File "./fairseq_cli/generate.py", line 201, in _main hypos = task.inference_step( File "/tmp/code/examples/speech_text_joint_to_text/tasks/speech_text_joint.py", line 216, in inference_step return generator.generate( File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "/tmp/code/fairseq/sequence_generator.py", line 191, in generate return self._generate(sample, kwargs) File "/tmp/code/fairseq/sequence_generator.py", line 266, in _generate encoder_outs = self.model.reorder_encoder_out(encoder_outs, new_order) File "/tmp/code/fairseq/sequence_generator.py", line 873, in reorder_encoder_out model.encoder.reorder_encoder_out(encoder_outs[i], new_order) File "/tmp/code/examples/speech_text_joint_to_text/models/s2t_dualinputtransformer.py", line 377, in reorder_encoder_out return self.spch_encoder.reorder_encoder_out(encoder_out, new_order) File "/tmp/code/fairseq/models/speech_to_text/s2t_wav_transformer.py", line 485, in reorder_encoder_out return self.speech_encoder.reorder_encoder_out(encoder_out, new_order) File "/tmp/code/fairseq/models/speech_to_text/s2t_wav_transformer.py", line 381, in reorder_encoder_out if len(encoder_out["encoder_out"]) == 0 TypeError: tuple indices must be integers or slices, not str

Environment

fairseq Version : main
PyTorch Version (1.10.0)
OS: Linux
How you installed fairseq (pip, source): Yes
Python version: 3.8.12
CUDA/cuDNN version: cuda_11.1
GPU models and configuration: A100

yuntang commented 2 years ago

@czy97 Thanks for pointing this out.
One line of input arguments is missing in the readme. Please add --user-dir examples/speech_text_joint_to_text --load-speech-only --model-overrides {'load_pretrained_speech_text_decoder':'','load_pretrained_speech_text_encoder':''} to the inference code. I will update this readme soon.

czy97 commented 2 years ago

@czy97 Thanks for pointing this out. One line of input arguments is missing in the readme. Please add --user-dir examples/speech_text_joint_to_text --load-speech-only --model-overrides {'load_pretrained_speech_text_decoder':'','load_pretrained_speech_text_encoder':''} to the inference code. I will update this readme soon.

It will arise another error: "generate.py: error: unrecognized arguments: load_pretrained_speech_text_encoder:".

yuntang commented 2 years ago

Can you show me the detailed log?

czy97 commented 2 years ago

Can you show me the detailed log?

[--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_prediction_adapters,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,speech_text_pretrain_cross_entropy,speech_text_pretrain_compound,guided_label_smoothed_cross_entropy_with_accuracy}] [--tokenizer {moses,nltk,space}] [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}] [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}] [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}] [--simul-type {hard_aligned,infinite_lookback,waitk,chunkwise,waitk_fixed_pre_decision,hard_aligned_fixed_pre_decision,infinite_lookback_fixed_pre_decision}] [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}] [--task TASK] [--num-workers NUM_WORKERS] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--grouped-shuffling] [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR] [--update-ordered-indices-seed] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-num-procs DISTRIBUTED_NUM_PROCS] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}] [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--not-fsdp-flatten-parameters] [--path PATH] [--post-process [POST_PROCESS]] [--quiet] [--model-overrides MODEL_OVERRIDES] [--results-path RESULTS_PATH] [--beam BEAM] [--nbest NBEST] [--max-len-a MAX_LEN_A] [--max-len-b MAX_LEN_B] [--min-len MIN_LEN] [--match-source-len] [--unnormalized] [--no-early-stop] [--no-beamable-mm] [--lenpen LENPEN] [--unkpen UNKPEN] [--replace-unk [REPLACE_UNK]] [--sacrebleu] [--score-reference] [--prefix-size PREFIX_SIZE] [--no-repeat-ngram-size NO_REPEAT_NGRAM_SIZE] [--sampling] [--sampling-topk SAMPLING_TOPK] [--sampling-topp SAMPLING_TOPP] [--constraints [{ordered,unordered}]] [--temperature TEMPERATURE] [--diverse-beam-groups DIVERSE_BEAM_GROUPS] [--diverse-beam-strength DIVERSE_BEAM_STRENGTH] [--diversity-rate DIVERSITY_RATE] [--print-alignment [{hard,soft}]] [--print-step] [--lm-path LM_PATH] [--lm-weight LM_WEIGHT] [--iter-decode-eos-penalty ITER_DECODE_EOS_PENALTY] [--iter-decode-max-iter ITER_DECODE_MAX_ITER] [--iter-decode-force-max-iter] [--iter-decode-with-beam ITER_DECODE_WITH_BEAM] [--iter-decode-with-external-reranker] [--retain-iter-history] [--retain-dropout] [--retain-dropout-modules RETAIN_DROPOUT_MODULES] [--decoding-format {unigram,ensemble,vote,dp,bs}] [--no-seed-provided] [--eos-token EOS_TOKEN] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-save-optimizer-state] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--arch ARCH] [--extractor-mode {default,layer_norm}] [--encoder-layers ENCODER_LAYERS] [--encoder-embed-dim ENCODER_EMBED_DIM] [--encoder-ffn-embed-dim ENCODER_FFN_EMBED_DIM] [--encoder-attention-heads ENCODER_ATTENTION_HEADS] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--layer-type {transformer,conformer}] [--dropout DROPOUT] [--attention-dropout ATTENTION_DROPOUT] [--activation-dropout ACTIVATION_DROPOUT] [--encoder-layerdrop ENCODER_LAYERDROP] [--dropout-input DROPOUT_INPUT] [--dropout-features DROPOUT_FEATURES] [--final-dim FINAL_DIM] [--layer-norm-first] [--conv-feature-layers CONV_FEATURE_LAYERS] [--conv-bias] [--logit-temp LOGIT_TEMP] [--quantize-targets] [--quantize-input] [--same-quantizer] [--target-glu] [--feature-grad-mult FEATURE_GRAD_MULT] [--quantizer-depth QUANTIZER_DEPTH] [--quantizer-factor QUANTIZER_FACTOR] [--latent-vars LATENT_VARS] [--latent-groups LATENT_GROUPS] [--latent-dim LATENT_DIM] [--mask-length MASK_LENGTH] [--mask-prob MASK_PROB] [--mask-selection {static,uniform,normal,poisson}] [--mask-other MASK_OTHER] [--no-mask-overlap] [--mask-min-space MASK_MIN_SPACE] [--require-same-masks] [--mask-dropout MASK_DROPOUT] [--mask-channel-length MASK_CHANNEL_LENGTH] [--mask-channel-prob MASK_CHANNEL_PROB] [--mask-channel-before] [--mask-channel-selection {static,uniform,normal,poisson}] [--mask-channel-other MASK_CHANNEL_OTHER] [--no-mask-channel-overlap] [--mask-channel-min-space MASK_CHANNEL_MIN_SPACE] [--num-negatives NUM_NEGATIVES] [--negatives-from-everywhere] [--cross-sample-negatives CROSS_SAMPLE_NEGATIVES] [--codebook-negatives CODEBOOK_NEGATIVES] [--conv-pos CONV_POS] [--conv-pos-groups CONV_POS_GROUPS] [--pos-conv-depth POS_CONV_DEPTH] [--latent-temp LATENT_TEMP] [--max-positions MAX_POSITIONS] [--checkpoint-activations] [--crop-seq-to-multiple CROP_SEQ_TO_MULTIPLE] [--depthwise-conv-kernel-size DEPTHWISE_CONV_KERNEL_SIZE] [--attn-type ATTN_TYPE] [--pos-enc-type POS_ENC_TYPE] [--config-yaml CONFIG_YAML] [--max-source-positions N] [--max-target-positions N] [--parallel-text-data PARALLEL_TEXT_DATA] [--max-tokens-text N] [--max-positions-text N] [--langpairs S] [--speech-sample-ratio N] [--text-sample-ratio N] [--update-mix-data] [--load-speech-only] [--mask-text-ratio V] [--mask-text-type {random,tail}] [--noise-token NOISE_TOKEN] [--infer-target-lang S] [--force-anneal FORCE_ANNEAL] [--lr-shrink LR_SHRINK] [--warmup-updates WARMUP_UPDATES] [--wer-tokenizer {none,zh,13a,char,intl,ja-mecab}] [--wer-remove-punct] [--wer-char-level] [--wer-lowercase] data generate.py: error: unrecognized arguments: load_pretrained_speech_text_encoder:

yuntang commented 2 years ago

{'load_pretrained_speech_text_decoder':'','load_pretrained_speech_text_encoder':''} is input for --model-overrides so you need to use quotation mark to wrap them.

czy97 commented 2 years ago

{'load_pretrained_speech_text_decoder':'','load_pretrained_speech_text_encoder':''} is input for --model-overrides so you need to use quotation mark to wrap them.

It works. Thank you very much for your quick reply.

czy97 commented 2 years ago

@yuntang Hi Yun, sorry for bother again. Can you provide the training hyper-parameters when librispeech 960 is used as the unlabeled data?

sathyagorla commented 2 years ago

Hi @czy97 @yuntang can you provide me inference script or command I want to test my trained model (checkpoint_best.pt) with some audio files can you help me to do testing

facebookresearch / fairseq