Open macarious opened 11 months ago
If I create a unigrams.txt
file which contains a list of unigrams and pass the file as a parameter in the build_ctcdecoder
without a .arpa
or .bin
file, the word-error-rate improves slightly. However, in this scenario, the probabilities generated from the kenlm toolkit were never taken into account.
Here are the steps:
unigrams
file from the .arpa
filesnippet from Colab Notebook:
line_split_parts = 2
if ngram_order > 1:
line_split = 3
with open(f"{repo_path_local}/{ngram_order}gram.arpa", "r") as read_file, open(f"{repo_path_local}/unigrams.txt", "w") as write_file:
start_1_gram = False
for line in read_file:
line = line.strip()
if line == "\\1-grams:":
start_1_gram = True
elif line == "\\2-grams:":
break
if start_1_gram and len(line) > 0:
parts = line.split("\t")
if len(parts) == line_split_parts:
write_file.write(f"{parts[1]}\n")
build_ctcdecoder
with the following parameterssnippet from Colab Notebook:
unigrams = set()
with open(f"/content/{lm_local_path}/unigrams.txt", "r") as f:
for line in f:
line = line.strip()
unigrams.add(line)
# Implement language model in the decoder
decoder = build_ctcdecoder(
labels=list(sorted_vocab_dict.keys()),
kenlm_model_path=lm_model_path if ngram_order > 1 else None,
unigrams=unigrams
)
With the unigrams passed into the decoder, the word-error-rate has improved slightly from 0.845 to 0.843.
The kenLM toolkit was able to train a unigram model using the Europarl dataset. However, there are currently two limitations with a unigram model (as oppose to n-gram models where n>1):
.arpa
file to a.bin
file..arpa
file cannot be used with thectcdecoder
. Specifically, an error occurs when using a unigram model created by the kenLM toolkit is passed as a parameter inbuild_ctcdecoder
from thepyctcdecode
library.The same runtime error occurs for both issues:
This ngram implementation assumes at least a bigram model.
The unigram file (in
.arpa
format) is saved here on Hugging Face: https://huggingface.co/macarious/europarl_bilingual_kenlm_1-gram/tree/mainThe following is the complete traceback from the second limitation: