"FileNotFoundError: KenLM binary file not found at : None" thrown when decoding without N-gram LM

aklemen commented 6 months ago

Describe the bug

I am trying to use an external LLM to rescore the results of beam search from Conformer-CTC model.

When trying to get the beam search results with the eval_beamsearch_ngram_ctc.py without passing the N-gram LM, I get the following error:

Traceback (most recent call last):
  File "/content/NeMo/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py", line 415, in main
    candidate_wer, candidate_cer = beam_search_eval(
  File "/content/NeMo/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py", line 196, in beam_search_eval
    _, beams_batch = decoding.ctc_decoder_predictions_tensor(
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/submodules/ctc_decoding.py", line 319, in ctc_decoder_predictions_tensor
    hypotheses_list = self.decoding(
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/submodules/ctc_beam_decoding.py", line 166, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1098, in __call__
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/submodules/ctc_beam_decoding.py", line 280, in forward
    hypotheses = self.search_algorithm(prediction_tensor, out_len)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/submodules/ctc_beam_decoding.py", line 314, in default_beam_search
    raise FileNotFoundError(
FileNotFoundError: KenLM binary file not found at : None. Please set a valid path in the decoding config.

Steps/Code to reproduce bug

Install decoders.

NEMO_PATH=<insert absolute path to NeMo directory>
cd $NEMO_PATH && bash scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh $NEMO_PATH

Run the beam search with the following config:

python3 $NEMO_PATH/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py \
nemo_model_file="<nemo CTC ASR model, e.g. stt_en_conformer_ctc_medium.nemo>" \
input_manifest="<manifest json file>" \
preds_output_folder="<output directory>" \
decoding_mode=beamsearch \
decoding_strategy="beam"

Expected behavior

I would expect the error to not be thrown as BeamSearchDecoderWithLM actually handles the case when the path to N-gram LM is not passed:

        # from  nemo/collections/asr/modules/beam_search_decoder.py
        if lm_path is not None:
            self.scorer = Scorer(alpha, beta, model_path=lm_path, vocabulary=vocab)
        else:
            self.scorer = None

When I removed the check for the KenLM file path from nemo/collections/asr/parts/submodules/ctc_beam_decoding.py, it worked:

            # Check for filepath
            if self.kenlm_path is None or not os.path.exists(self.kenlm_path):
                raise FileNotFoundError(
                    f"KenLM binary file not found at : {self.kenlm_path}. "
                    f"Please set a valid path in the decoding config."
                )

Environment overview

Environment location: Google Colab
Method of NeMo install: python -m pip install git+https://github.com/NVIDIA/NeMo.git@v1.23.0#egg=nemo_toolkit[all]

Environment details

OS version: Ubuntu 22.04.4 LTS
PyTorch version: 2.2.1+cu121
Python version: 3.10

Additional context

GPU: T4

nithinraok commented 6 months ago

Update: We observed couple of code changes required with this script due to recent updates during the model and transcription refactoring. @karpov-nick is working to provide a fix for this.

karpnv commented 6 months ago

There is a work in progress in the PR https://github.com/NVIDIA/NeMo/pull/8428

aklemen commented 6 months ago

Thank you both!

karpnv commented 5 months ago

You can try decoding without N-gram at the branch karpnv/beamsearch with parameters

python3 ./scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py \
model_path=./am_model.nemo  \
dataset_manifest=./manifest.json  \
preds_output_folder=/tmp   \
ctc_decoding.strategy=flashlight \
ctc_decoding.beam.nemo_kenlm_path="" \
ctc_decoding.beam.beam_size=[4]   \
ctc_decoding.beam.beam_beta=[0.5]

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

"FileNotFoundError: KenLM binary file not found at : None" thrown when decoding without N-gram LM #9067