eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
45 stars 9 forks source link

ValueError: invalid literal for int() with base 10: 'm\t13' #85

Open HURIMOZ opened 3 weeks ago

HURIMOZ commented 3 weeks ago

Hello everyone, I get this error on the train command. It says I have a tab somewhere in my data but I actually have none (beside the vocab.shared file generated).

[2024-08-18 23:44:26,900 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 69, in run
    train(config)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 56, in train
    train_process(config, device_id=0)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 142, in main
    checkpoint, vocabs, transforms, config = _init_train(config)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 100, in _init_train
    vocabs, transforms = prepare_transforms_vocabs(config, transforms_cls)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 35, in prepare_transforms_vocabs
    vocabs = build_vocab(config, specials)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 31, in build_vocab
    src_vocab = _read_vocab_file(config.src_vocab, config.src_words_min_frequency)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 109, in _read_vocab_file
    if int(line.split(None, 1)[1]) >= min_count:
ValueError: invalid literal for int() with base 10: 'm\t13'

What other character could it be?

francoishernandez commented 3 weeks ago

Hello, The issue is probably not the tab but what lies before in the line. See the _read_vocab_file function: https://github.com/eole-nlp/eole/blob/5120fdbd06132cd7d16b9fe65384c2affe95b199/eole/inputters/inputter.py#L87-L113

Judging by your error trace, I would guess that there is a line with " m\t13" (additional space between "" and "m"). If not that, you'll need to provide more details on the exact content of the vocab file which is causing the issue.

HURIMOZ commented 3 weeks ago

Hi François, I did a couple more tests to try and solve this error but I still canʻt find it. This is my prepare_wmt_frty_data.sh file:

#!/usr/bin/env bash

if ! command -v subword-nmt &>/dev/null; then
  echo "Please install Subword NMT: pip3 install subword-nmt"
  exit 2
fi

# Set the directory paths
DATA_DIR=/home/ubuntu/TY-EN/eole/recipes/wmt17/data
PROCESSED_DIR=/home/ubuntu/TY-EN/eole/recipes/wmt17/processed_data

# Create symbolic links for the existing files
ln -s $DATA_DIR/src-train.txt $PROCESSED_DIR/train.src
ln -s $DATA_DIR/tgt-train.txt $PROCESSED_DIR/train.trg
ln -s $DATA_DIR/src-val.txt $PROCESSED_DIR/dev.src
ln -s $DATA_DIR/tgt-val.txt $PROCESSED_DIR/dev.trg
ln -s $DATA_DIR/src-test.txt $PROCESSED_DIR/test.src

# Learn BPE codes
cat $PROCESSED_DIR/train.src $PROCESSED_DIR/train.trg | subword-nmt learn-bpe -s 32000 > $PROCESSED_DIR/codes

# Apply BPE to the files
for LANG in src trg; do
  subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/train.$LANG > $PROCESSED_DIR/train.$LANG.bpe
  for SET in dev; do
    subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/$SET.$LANG > $PROCESSED_DIR/$SET.$LANG.bpe
  done
done
subword-nmt apply-bpe -c $PROCESSED_DIR/codes < $PROCESSED_DIR/test.src > $PROCESSED_DIR/test.src.bpe

# Filter and shuffle the training data
python3 filter_train.py
paste -d '\t' $PROCESSED_DIR/train.src.bpe.filter $PROCESSED_DIR/train.trg.bpe.filter | shuf | awk -v FS="\t" '{ print $1 > "'$PROCESSED_DIR'/train.src.bpe.shuf" ; print $2 > "'$PROCESSED_DIR'/train.trg.bpe.shuf" }'

I had to adapt the original code to my dataset as I already have it unzipped and in txt format (src-train.txt, tgt-train.txt, src-val.txt, tgt-val.txt and src-test.txt), so I donʻt need to download and unzip the files. Yet Iʻm not sure whether itʻs my adaptation that is not good or whether there is another problem as after I built the vocab, I now get another error:

(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole build_vocab --config wmt17_enty.yaml --n_sample -1 # --num_threads 4
[2024-08-23 07:13:26,453 INFO] Transforms applied: ['normalize', 'filtertoolong', 'onmt_tokenize']
[2024-08-23 07:13:26,453 INFO] Counter vocab from -1 samples.
[2024-08-23 07:13:26,453 INFO] n_sample=-1: Build vocab on full datasets.
[2024-08-23 07:13:40,524 INFO] * Transform statistics for corpus_1(100.00%):
                        * SubwordStats: 1828146 -> 189682 tokens

[2024-08-23 07:13:40,556 INFO] Counters src: 21786
[2024-08-23 07:13:40,556 INFO] Counters tgt: 14753
[2024-08-23 07:13:40,563 INFO] Counters after share:32144
(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole train --config wmt17_enty.yaml
[2024-08-23 07:13:49,822 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 69, in run
    train(config)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 56, in train
    train_process(config, device_id=0)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 141, in main
    checkpoint, vocabs, transforms, config = _init_train(config)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 96, in _init_train
    vocabs, transforms = prepare_transforms_vocabs(config, transforms_cls)
  File "/home/ubuntu/TY-EN/eole/eole/train_single.py", line 35, in prepare_transforms_vocabs
    vocabs = build_vocab(config, specials)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 31, in build_vocab
    src_vocab = _read_vocab_file(config.src_vocab, config.src_words_min_frequency)
  File "/home/ubuntu/TY-EN/eole/eole/inputters/inputter.py", line 109, in _read_vocab_file
    if int(line.split(None, 1)[1]) >= min_count:
ValueError: invalid literal for int() with base 10: '@@\t12'

Hereʻs my full config:

## IO
save_data: wmt17_en_ty
overwrite: true
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]

### Vocab
src_vocab: data/vocab.shared
tgt_vocab: data/vocab.shared
src_vocab_size: 32000
tgt_vocab_size: 28000
vocab_size_multiple: 8
src_words_min_frequency: 1
tgt_words_min_frequency: 1
share_vocab: true
n_sample: -1

data:
    corpus_1:
        path_src: processed_data/train.src.bpe.shuf
        path_tgt: processed_data/train.trg.bpe.shuf
    valid:
        path_src: processed_data/dev.src.bpe
        path_tgt: processed_data/dev.trg.bpe

transforms: [normalize, onmt_tokenize, filtertoolong]
transforms_configs:
  normalize:
    norm_quote_commas: True
    norm_numbers: True
  onmt_tokenize:
    src_subword_type: bpe
    tgt_subword_type: bpe
  filtertoolong:
    src_seq_length: 512
    tgt_seq_length: 512

training:
    # Model configuration
    model_path: models
    keep_checkpoint: 50
    save_checkpoint_steps: 1000
    average_decay: 0
    train_steps: 100000
    valid_steps: 10000

    # bucket_size: 
    bucket_size: 2048
    num_workers: 4
    prefetch_factor: 4
    world_size: 1
    gpu_ranks: [0]
    batch_type: "tokens"
    batch_size: 2048
    valid_batch_size: 1024
    batch_size_multiple: 8
    accum_count: [10]
    accum_steps: [0]
    dropout_steps: [0]
    dropout: [0.2]
    attention_dropout: [0.2]
    compute_dtype: fp16
    optim: "adam"
    learning_rate: 2
    warmup_steps: 4000
    decay_method: "noam"
    adam_beta2: 0.998
    max_grad_norm: 0
    label_smoothing: 0.1
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256

# Pretrained embeddings configuration for the source language
    embeddings:
        word_vec_size: 256
        position_encoding_type: "SinusoidalInterleaved"
        #embeddings_type: "word2vec"
        #src_embeddings: data/cc.en.256.txt
francoishernandez commented 3 weeks ago

The error is the same, it's just failing on another line of the vocab. Please share the part of the vocab which contains the problematic string mentioned in the trace, will be more efficient than reproducing a full pipeline.

HURIMOZ commented 3 weeks ago

Hi Francois, are you talking about the vocab.share file generated? If so, every vocab line contains a tabulation between the token and the frequency.

francoishernandez commented 3 weeks ago

Yes. Please share an excerpt of this file. If the split operation is returning things like '@@\t12', there must be other characters causing this, or an edge case that is poorly handled in the _read_vocab_file function.

HURIMOZ commented 3 weeks ago

Here are the first few lines of the vocab.shared file:

te  114442
i   95107
the 39918
e   37730
mau 32160
ʻe  25304
and 25033
of  23210
to  20439
nō  18887
ʻua 18190
a   18060
o   14556
mai 14260
in  12445
ʻia 12407
tātou   10039
ia  9999
roto    9839
ra  9373
that    9152
atu 8816
tō  8489
rātou   8201
nei 7570
parau   7459
hōʻē    6695
nā  6582
ʻoia    6543
for 6350
ʻo  6183
we  6154
mea 6127
he  6073
tē  5932
tā  5898
ʻoutou  5897
is  5647
taʻata  5619
his 5275
niʻa    5175
be  5167
with    5130
teie    5107
you 5021
reira   4814
it  4667
au  4666
they    4503
noa 4497
tei 4449
rahi    4440
atoʻa   4380
ʻa  4353
tōna    4310
as  4094
our 4088
was 3994
rā  3718
rave    3716
roa 3658
ʻite    3627
ai  3562
my  3552
are 3540
ʻei 3511
ē,  3344
have    3334
ʻohipa  3327
this    3289
will    3256
tāna    3172
ʻoe 3126
on  3092
tahi    3071
īa  3071
iho 2990
ʻaita   2980
hoʻi    2937
not 2936
by  2929
nehenehe    2893
tōʻu    2881
vau 2870
their   2746
your    2687
tiʻa    2686
ma  2658
feiā    2657
muri    2653
from    2607
aʻe 2603
who 2602
all 2499
haere   2484
-   2477
ʻore    2416
us  2394
maitaʻi 2359
HURIMOZ commented 3 weeks ago

Iʻm wondering if it could be the glottal stop causing the issue. I remember I had to look under the hood of SentencePiece so that it could work with the glottal stop (U+02BB).

francoishernandez commented 3 weeks ago

The issue is not in the excerpt you provided. Check the trace. It is failing at the line containing '@@\t12' (so around the end of the file according to the frequency).