Open HURIMOZ opened 3 weeks ago
Hello,
The issue is probably not the tab but what lies before in the line.
See the _read_vocab_file
function:
https://github.com/eole-nlp/eole/blob/5120fdbd06132cd7d16b9fe65384c2affe95b199/eole/inputters/inputter.py#L87-L113
Judging by your error trace, I would guess that there is a line with "
Hi François, I did a couple more tests to try and solve this error but I still canʻt find it. This is my prepare_wmt_frty_data.sh file:
#!/usr/bin/env bash
if ! command -v subword-nmt &>/dev/null; then
echo "Please install Subword NMT: pip3 install subword-nmt"
exit 2
fi
# Set the directory paths
DATA_DIR=/home/ubuntu/TY-EN/eole/recipes/wmt17/data
PROCESSED_DIR=/home/ubuntu/TY-EN/eole/recipes/wmt17/processed_data
# Create symbolic links for the existing files
ln -s $DATA_DIR/src-train.txt $PROCESSED_DIR/train.src
ln -s $DATA_DIR/tgt-train.txt $PROCESSED_DIR/train.trg
ln -s $DATA_DIR/src-val.txt $PROCESSED_DIR/dev.src
ln -s $DATA_DIR/tgt-val.txt $PROCESSED_DIR/dev.trg
ln -s $DATA_DIR/src-test.txt $PROCESSED_DIR/test.src
# Learn BPE codes
cat $PROCESSED_DIR/train.src $PROCESSED_DIR/train.trg | subword-nmt learn-bpe -s 32000 > $PROCESSED_DIR/codes
# Apply BPE to the files
for LANG in src trg; do
subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/train.$LANG > $PROCESSED_DIR/train.$LANG.bpe
for SET in dev; do
subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/$SET.$LANG > $PROCESSED_DIR/$SET.$LANG.bpe
done
done
subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/test.src > $PROCESSED_DIR/test.src.bpe
# Filter and shuffle the training data
python3 filter_train.py
paste -d '\t' $PROCESSED_DIR/train.src.bpe.filter $PROCESSED_DIR/train.trg.bpe.filter | shuf | awk -v FS="\t" '{ print $1 > "'$PROCESSED_DIR'/train.src.bpe.shuf" ; print $2 > "'$PROCESSED_DIR'/train.trg.bpe.shuf" }'
I had to adapt the original code to my dataset as I already have it unzipped and in txt format (src-train.txt, tgt-train.txt, src-val.txt, tgt-val.txt and src-test.txt), so I donʻt need to download and unzip the files. Yet Iʻm not sure whether itʻs my adaptation that is not good or whether there is another problem as after I built the vocab, I now get another error:
(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole build_vocab --config wmt17_enty.yaml --n_sample -1 # --num_threads 4
[2024-08-23 07:13:26,453 INFO] Transforms applied: ['normalize', 'filtertoolong', 'onmt_tokenize']
[2024-08-23 07:13:26,453 INFO] Counter vocab from -1 samples.
[2024-08-23 07:13:26,453 INFO] n_sample=-1: Build vocab on full datasets.
[2024-08-23 07:13:40,524 INFO] * Transform statistics for corpus_1(100.00%):
* SubwordStats: 1828146 -> 189682 tokens
[2024-08-23 07:13:40,556 INFO] Counters src: 21786
[2024-08-23 07:13:40,556 INFO] Counters tgt: 14753
[2024-08-23 07:13:40,563 INFO] Counters after share:32144
(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole train --config wmt17_enty.yaml
[2024-08-23 07:13:49,822 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
Traceback (most recent call last):
File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
bin_cls.run(args)
File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 69, in run
train(config)
File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 56, in train
train_process(config, device_id=0)
File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 141, in main
checkpoint, vocabs, transforms, config = _init_train(config)
File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 96, in _init_train
vocabs, transforms = prepare_transforms_vocabs(config, transforms_cls)
File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 35, in prepare_transforms_vocabs
vocabs = build_vocab(config, specials)
File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 31, in build_vocab
src_vocab = _read_vocab_file(config.src_vocab, config.src_words_min_frequency)
File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 109, in _read_vocab_file
if int(line.split(None, 1)[1]) >= min_count:
ValueError: invalid literal for int() with base 10: '@@\t12'
Hereʻs my full config:
## IO
save_data: wmt17_en_ty
overwrite: true
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: data/vocab.shared
tgt_vocab: data/vocab.shared
src_vocab_size: 32000
tgt_vocab_size: 28000
vocab_size_multiple: 8
src_words_min_frequency: 1
tgt_words_min_frequency: 1
share_vocab: true
n_sample: -1
data:
corpus_1:
path_src: processed_data/train.src.bpe.shuf
path_tgt: processed_data/train.trg.bpe.shuf
valid:
path_src: processed_data/dev.src.bpe
path_tgt: processed_data/dev.trg.bpe
transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
normalize:
norm_quote_commas: True
norm_numbers: True
onmt_tokenize:
src_subword_type: bpe
tgt_subword_type: bpe
filtertoolong:
src_seq_length: 512
tgt_seq_length: 512
training:
# Model configuration
model_path: models
keep_checkpoint: 50
save_checkpoint_steps: 1000
average_decay: 0
train_steps: 100000
valid_steps: 10000
# bucket_size:
bucket_size: 2048
num_workers: 4
prefetch_factor: 4
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 2048
valid_batch_size: 1024
batch_size_multiple: 8
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.2]
attention_dropout: [0.2]
compute_dtype: fp16
optim: "adam"
learning_rate: 2
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
model:
architecture: "transformer"
hidden_size: 256
share_decoder_embeddings: true
share_embeddings: false
layers: 6
heads: 8
transformer_ff: 256
# Pretrained embeddings configuration for the source language
embeddings:
word_vec_size: 256
position_encoding_type: "SinusoidalInterleaved"
#embeddings_type: "word2vec"
#src_embeddings: data/cc.en.256.txt
The error is the same, it's just failing on another line of the vocab. Please share the part of the vocab which contains the problematic string mentioned in the trace, will be more efficient than reproducing a full pipeline.
Hi Francois, are you talking about the vocab.share file generated? If so, every vocab line contains a tabulation between the token and the frequency.
Yes. Please share an excerpt of this file. If the split operation is returning things like '@@\t12', there must be other characters causing this, or an edge case that is poorly handled in the _read_vocab_file
function.
Here are the first few lines of the vocab.shared file:
te 114442
i 95107
the 39918
e 37730
mau 32160
ʻe 25304
and 25033
of 23210
to 20439
nō 18887
ʻua 18190
a 18060
o 14556
mai 14260
in 12445
ʻia 12407
tātou 10039
ia 9999
roto 9839
ra 9373
that 9152
atu 8816
tō 8489
rātou 8201
nei 7570
parau 7459
hōʻē 6695
nā 6582
ʻoia 6543
for 6350
ʻo 6183
we 6154
mea 6127
he 6073
tē 5932
tā 5898
ʻoutou 5897
is 5647
taʻata 5619
his 5275
niʻa 5175
be 5167
with 5130
teie 5107
you 5021
reira 4814
it 4667
au 4666
they 4503
noa 4497
tei 4449
rahi 4440
atoʻa 4380
ʻa 4353
tōna 4310
as 4094
our 4088
was 3994
rā 3718
rave 3716
roa 3658
ʻite 3627
ai 3562
my 3552
are 3540
ʻei 3511
ē, 3344
have 3334
ʻohipa 3327
this 3289
will 3256
tāna 3172
ʻoe 3126
on 3092
tahi 3071
īa 3071
iho 2990
ʻaita 2980
hoʻi 2937
not 2936
by 2929
nehenehe 2893
tōʻu 2881
vau 2870
their 2746
your 2687
tiʻa 2686
ma 2658
feiā 2657
muri 2653
from 2607
aʻe 2603
who 2602
all 2499
haere 2484
- 2477
ʻore 2416
us 2394
maitaʻi 2359
Iʻm wondering if it could be the glottal stop causing the issue. I remember I had to look under the hood of SentencePiece so that it could work with the glottal stop (U+02BB).
The issue is not in the excerpt you provided. Check the trace. It is failing at the line containing '@@\t12'
(so around the end of the file according to the frequency).
Hello everyone, I get this error on the train command. It says I have a tab somewhere in my data but I actually have none (beside the vocab.shared file generated).
What other character could it be?