Inferencing wav2vec2.0 finetuned model doesn't utilize gpu

deepspiking commented 4 years ago

🐛 Bug

I made my own wav2vec2.0 model, and pretrained it with some data. model training was successful, and inference is working with the command written in wav2vec2.0 README.md. But the problem is it's extreamly slow. It almost took more than 2 hours to process about 6000 utterences. I think it doesn't utilize gpu at all.

so, I put --cpu option to compare to the case of without --cpu option, and it turned out it took almost same time.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error Even witout --cpu option, it just consumes the gpu memory but doesnt utilize computing power of it at all.

Environment

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0) 1.6
OS (e.g., Linux): ubuntu
How you installed fairseq (pip, source): as master's README.md explained, clone the repository and type 'pip install -editable' .
Build command you used (if compiling from source): pip install -editable' .
Python version: 3.6
CUDA/cuDNN version:
GPU models and configuration: Nvidia V100
Any other relevant information:

alexeib commented 4 years ago

i cant reproduce this. i ran infer.py (on gpu 1) and this is what i see

are you sure you have gpu pytorch installed? can pytorch see your gpus?

deepspiking commented 3 years ago

Sorry for late response! For sure, I can see my gpus in python with torch like here.

And also, you can see the gpu memory allocation made by the inference process in above the second screen shot that I attached. Stil, same problem..

The pretraining & finetuning was perfect and then I'm so frustrated that I can't inference with gpu.. Could you let me know what information should I provide to you to tackle this problem? Thanks:)

alexeib commented 3 years ago

can you modify infer.py and print out some diagnostics, in particular: verify that

use_cuda = torch.cuda.is_available() and not args.cpu

is True

and make sure that the tensors inside the sample after this line:

sample = utils.move_to_cuda(sample) if use_cuda else sample

are in fact on GPU devices?

deepspiking commented 3 years ago

Thank you for reopening the issue! I added these:

361             print("="*60)
362             print("use_cuda is " + str(use_cuda))
363             print("model location is " + str(next(models[0].parameters()).device))
364             print("sample['id'] location is "+str(sample['id'].device))
365             print("sample['net_input']['source'] location is "+ str(sample['net_input']['source'].device))
366             print("sample['net_input']['padding_mask'] location is "+ str(sample['net_input']['padding_mask'].device))
367             print("sample['target_lengths'] location is "+ str(sample['target_lengths'].device))
368             print("sample['target'] location is "+ str(sample['target'].device))
369             print("="*60)

just before inference function:

hypos = task.inference_step(generator, models, sample, prefix_tokens)

I've got following results that seems what I expected.

============================================================
use_cuda is True
model location is cuda:0
sample['id'] location is cuda:0
sample['net_input']['source'] location is cuda:0
sample['net_input']['padding_mask'] location is cuda:0
sample['target_lengths'] location is cuda:0
sample['target'] location is cuda:0
============================================================

And I attached more parameter informations below:

INFO:main:Namespace(all_gather_list_size=16384, beam=5, beam_size_token=100, beam_threshold=25.0, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', cpu=False, criterion='ctc', data='./manifest_out_ft_100k', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', decoding_format=None, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, dump_emissions=None, dump_features=None, empty_cache_freq=0, enable_padding=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='valid', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, kenlm_model=None, kspmodel=None, labels='ltr', lenpen=1, lexicon=None, lm_weight=0.2, load_emissions=None, localsgd_frequency=3, log_format=None, log_interval=100, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sample_size=None, max_sentences=None, max_tokens=4000000, memory_efficient_bf16=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, min_sample_size=None, model_overrides='{}', model_parallel_size=1, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, no_seed_provided=True, normalize=False, nprocs_per_node=8, num_shards=1, num_workers=10, optimizer=None, path='./model_base_ft_100k/checkpoint_best.pt', prefix_size=0, print_alignment=False, print_step=False, profile=False, quantization_config_path=None, quiet=False, remove_bpe='letter', replace_unk=None, required_batch_size_multiple=8, results_path='./manifest_out_ft_100k/res2', retain_dropout=False, retain_dropout_modules=None, retain_iter_history=False, rnnt_decoding_type='greedy', rnnt_len_penalty=-0.5, sacrebleu=False, sample_rate=16000, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, sil_weight=0.0, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, task='audio_pretraining', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, tpu=False, unit_lm=False, unk_weight=-inf, unkpen=0, unnormalized=False, user_dir=None, w2l_decoder='viterbi', warmup_updates=0, wer_args=None, wfstlm=None, word_score=1.0, zero_infinity=False)

jumon commented 3 years ago

I am having a similar problem as @deepspiking . When I ran infer.py with LibriSpeech dev-other as below, it ended in about a minute.

python ~/fairseq/examples/speech_recognition/infer.py ./manifest_libri --task audio_pretraining --w2l-decoder viterbi \
--nbest 1 --path ./model/wav2vec_small_100h.pt --gen-subset valid --results-path ./results \
--word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 --post-process letter

However, when I ran infer.py with my own data and fine-tuned model, it was super slow. It consumed a gpu memory, but didn't seem to utilize a gpu. I am not sure it has something to do with this problem, but the data I'm using is NOT English and its vocabulary size is about 3000.

python ~/fairseq/examples/speech_recognition/infer.py ./manifest_myown --task audio_pretraining --w2l-decoder viterbi \
--nbest 1 --path ./model/checkpoint_best.pt --gen-subset valid --results-path ./results \
--word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 --post-process letter

alumae commented 3 years ago

@jumon, what is the size of your output layer (i.e., how many letters do you have in dict.ltr.txt)? I have experienced that viterbi decoding becomes very slow if the number of letters is large (i.e., around 3000 for a language like Cantonese), while it is fast if the number of letters is small (hundreds). Decoding using kenlm makes it faster.

jumon commented 3 years ago

Thanks, @alumae. The size of my output layer is around 3,000. Now I am not sure what viterbi decoding in this context means. Is it different from the CTC decoding algorithm in which we just take argmax of each time-step and remove repeated labels and blank tokens?

facebookresearch / fairseq