While training, I encountered an error such as an invalid character in the vocab.
This occurred when the corpus size exceeded approximately 40 million sentence pairs.
I know that there was already a similar topic OpenNMT-py/pull/2041
However I am getting an error with OpenNMT-py versions v3.4.0-v3.4.3.
To continue learning, I used this dirty hack in onmt_tools.py:
def sp_vocab_to_onmt_vocab(sp_vocab, onmt_vocab):
print(f"Converting {sp_vocab}")
with open(sp_vocab, 'r', encoding="utf-8") as fin:
with open(onmt_vocab, 'wb') as fout:
OMIT = (DefaultTokens.UNK, DefaultTokens.BOS, DefaultTokens.EOS)
for line in fin:
try:
w, c = line.rstrip("\n").split(None, 1)
except Exception as e:
print("An error occurred:", e)
if w in OMIT:
continue
c = math.exp(float(c)) * 1000000
c = int(c) + 1
fout.write(f'{w}\t{c}\n'.encode("utf-8"))
print(f"Wrote {onmt_vocab}")
However, I understand that the try-except construct rather masks the problem.
While training, I encountered an error such as an invalid character in the vocab. This occurred when the corpus size exceeded approximately 40 million sentence pairs. I know that there was already a similar topic OpenNMT-py/pull/2041 However I am getting an error with OpenNMT-py versions v3.4.0-v3.4.3.
To continue learning, I used this dirty hack in
onmt_tools.py
:However, I understand that the try-except construct rather masks the problem.