LibreTranslate / Locomotive

Toolkit for training/converting LibreTranslate compatible language models 🚂
GNU Affero General Public License v3.0
46 stars 11 forks source link

Vocab file has invalid line with large datasets #10

Closed LynxPDA closed 9 months ago

LynxPDA commented 9 months ago

While training, I encountered an error such as an invalid character in the vocab. This occurred when the corpus size exceeded approximately 40 million sentence pairs. I know that there was already a similar topic OpenNMT-py/pull/2041 However I am getting an error with OpenNMT-py versions v3.4.0-v3.4.3.

To continue learning, I used this dirty hack in onmt_tools.py:

def sp_vocab_to_onmt_vocab(sp_vocab, onmt_vocab):
    print(f"Converting {sp_vocab}")
    with open(sp_vocab, 'r', encoding="utf-8") as fin:
        with open(onmt_vocab, 'wb') as fout:
            OMIT = (DefaultTokens.UNK, DefaultTokens.BOS, DefaultTokens.EOS)
            for line in fin:
                try:
                    w, c = line.rstrip("\n").split(None, 1)
                except Exception as e:
                    print("An error occurred:", e)
                if w in OMIT:
                    continue
                c = math.exp(float(c)) * 1000000
                c = int(c) + 1
                fout.write(f'{w}\t{c}\n'.encode("utf-8"))
    print(f"Wrote {onmt_vocab}")

However, I understand that the try-except construct rather masks the problem.

pierotofy commented 9 months ago

Thanks, I've merged a similar approach in https://github.com/LibreTranslate/Locomotive/commit/8f3e4a7e16dfc5808a910fafebd5d510dd73c404 :pray: