Vocab file has invalid line with large datasets

While training, I encountered an error such as an invalid character in the vocab. This occurred when the corpus size exceeded approximately 40 million sentence pairs. I know that there was already a similar topic OpenNMT-py/pull/2041 However I am getting an error with OpenNMT-py versions v3.4.0-v3.4.3.

To continue learning, I used this dirty hack in onmt_tools.py:

def sp_vocab_to_onmt_vocab(sp_vocab, onmt_vocab):
    print(f"Converting {sp_vocab}")
    with open(sp_vocab, 'r', encoding="utf-8") as fin:
        with open(onmt_vocab, 'wb') as fout:
            OMIT = (DefaultTokens.UNK, DefaultTokens.BOS, DefaultTokens.EOS)
            for line in fin:
                try:
                    w, c = line.rstrip("\n").split(None, 1)
                except Exception as e:
                    print("An error occurred:", e)
                if w in OMIT:
                    continue
                c = math.exp(float(c)) * 1000000
                c = int(c) + 1
                fout.write(f'{w}\t{c}\n'.encode("utf-8"))
    print(f"Wrote {onmt_vocab}")

However, I understand that the try-except construct rather masks the problem.

LibreTranslate / Locomotive

Vocab file has invalid line with large datasets #10