SlangLab-NU / torgo_inference

0 stars 2 forks source link

Implement unigram model with ctcdecoder #23

Open macarious opened 11 months ago

macarious commented 11 months ago

The kenLM toolkit was able to train a unigram model using the Europarl dataset. However, there are currently two limitations with a unigram model (as oppose to n-gram models where n>1):

  1. The unigram model cannot be converted from a .arpa file to a .bin file.
  2. The unigram .arpa file cannot be used with the ctcdecoder. Specifically, an error occurs when using a unigram model created by the kenLM toolkit is passed as a parameter in build_ctcdecoder from the pyctcdecode library.

The same runtime error occurs for both issues: This ngram implementation assumes at least a bigram model.

The unigram file (in .arpa format) is saved here on Hugging Face: https://huggingface.co/macarious/europarl_bilingual_kenlm_1-gram/tree/main

The following is the complete traceback from the second limitation:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
kenlm.pyx in kenlm.Model.__init__()

RuntimeError: lm/model.cc:100 in void lm::ngram::detail::GenericModel<Search, VocabularyT>::InitializeFromARPA(int, const char*, const lm::ngram::Config&) [with Search = lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>; VocabularyT = lm::ngram::ProbingVocabulary] threw FormatLoadException.
This ngram implementation assumes at least a bigram model. Byte: 23

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
[<ipython-input-19-e0067b9b9f0d>](https://localhost:8080/#) in <cell line: 1>()
----> 1 wer_score_lm, predictions_lm, references_lm = evaluateModel(processor, model, torgo_test_set, f"/content/{lm_local_path}/{kenlm_model}")
      2 
      3 print(f"WER (trigram): {wer_score_lm}")

1 frames
[/usr/local/lib/python3.10/dist-packages/pyctcdecode/decoder.py](https://localhost:8080/#) in build_ctcdecoder(labels, kenlm_model_path, unigrams, alpha, beta, unk_score_offset, lm_score_boundary)
    905         instance of BeamSearchDecoderCTC
    906     """
--> 907     kenlm_model = None if kenlm_model_path is None else kenlm.Model(kenlm_model_path)
    908     if kenlm_model_path is not None and kenlm_model_path.endswith(".arpa"):
    909         logger.info("Using arpa instead of binary LM file, decoder instantiation might be slow.")

kenlm.pyx in kenlm.Model.__init__()

OSError: Cannot read model '/content/kenlm_model_1gram/1gram.arpa' (lm/model.cc:100 in void lm::ngram::detail::GenericModel<Search, VocabularyT>::InitializeFromARPA(int, const char*, const lm::ngram::Config&) [with Search = lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>; VocabularyT = lm::ngram::ProbingVocabulary] threw FormatLoadException. This ngram implementation assumes at least a bigram model. Byte: 23)
macarious commented 11 months ago

Experiment 1:

If I create a unigrams.txt file which contains a list of unigrams and pass the file as a parameter in the build_ctcdecoder without a .arpa or .bin file, the word-error-rate improves slightly. However, in this scenario, the probabilities generated from the kenlm toolkit were never taken into account.

Here are the steps:

1. Creating a unigrams file from the .arpa file

snippet from Colab Notebook:

line_split_parts = 2
if ngram_order > 1:
  line_split = 3

with open(f"{repo_path_local}/{ngram_order}gram.arpa", "r") as read_file, open(f"{repo_path_local}/unigrams.txt", "w") as write_file:
  start_1_gram = False
  for line in read_file:
    line = line.strip()
    if line == "\\1-grams:":
      start_1_gram = True
    elif line == "\\2-grams:":
      break
    if start_1_gram and len(line) > 0:
      parts = line.split("\t")
      if len(parts) == line_split_parts:
        write_file.write(f"{parts[1]}\n")

2. Implementing a decoder from build_ctcdecoder with the following parameters

snippet from Colab Notebook:

    unigrams = set()

    with open(f"/content/{lm_local_path}/unigrams.txt", "r") as f:
      for line in f:
        line = line.strip()
        unigrams.add(line)

    # Implement language model in the decoder
    decoder = build_ctcdecoder(
        labels=list(sorted_vocab_dict.keys()),
        kenlm_model_path=lm_model_path if ngram_order > 1 else None,
        unigrams=unigrams
    )

Results

With the unigrams passed into the decoder, the word-error-rate has improved slightly from 0.845 to 0.843. Image