OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 55: invalid start byte #2579

Closed fkurushin closed 2 months ago

fkurushin commented 3 months ago

I have trained the bpe model with google sentencepiece spm_train tool. when I am trying to build vocabulary with onmt_build_vocab tool, the error is raised:

spm_train --input=/data/translator/parallel/zh_train.txt \
    --model_prefix=/data/translator/code/zh\
    --train_extremely_large_corpus=True\
    --minloglevel=1\
    --num_threads=40\
    --vocab_size=30000\
    --model_type=bpe

The same with the ru model and then:

onmt_build_vocab -config zh-ru-translator.yaml -n_sample -1

This is configuration:

# zh-ru-translator.yaml
## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: /data/translator/code/zh.vocab
tgt_vocab: /data/translator/code/ru.vocab

# Should match the vocab size for SentencePiece
src_vocab_size: 30000
tgt_vocab_size: 30000

share_vocab: False

# Corpus opts:
data:
    corpus_1:
        path_src: /data/translator/parallel/zh_train.txt
        path_tgt: /data/translator/parallel/ru_train.txt
        weight: 1
        transforms: [bpe, filtertoolong]
    valid:
        path_src: /data/translator/parallel/zh_valid.txt
        path_tgt: /data/translator/parallel/ru_valid.txt
        transforms: [bpe, filtertoolong]

### Transform related opts:
#### Subword
src_subword_model: /data/translator/code/zh.model
tgt_subword_model: /data/translator/code/ru.model
#### Filter
src_seq_length: 150
tgt_seq_length: 150

does anybody faced this issue before?

PS: Previously I have tried the open net bpe version, it was too slow for me, it run about 2 days without any result.

fkurushin commented 2 months ago

using python3 OpenNMT-py/tools/spm_to_vocab.py solved my issue