Closed HURIMOZ closed 1 month ago
Hardly possible to help you with so little detail. The BLEU metric is populated in tensorboard logs like all the other metrics. And if you see the panel in tensorboard, it means it's enabled. So, might be a tensorboard setting issue more than anything. Provide config/screenshots/details/logs(text/tensorboard) if you want further help. Also, it might be good to move these kind of "issues" to "discussions", as they are rather usage issues than code/implementation related issues (mostly).
Hi François, hereʻs my full config:
## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1
data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
valid:
path_src: data/EN-val.txt
path_tgt: data/TY-val.txt
#transforms: [onmt_tokenize]
transforms: [sentencepiece]
transforms_configs:
#normalize:
#src_lang: en
#tgt_lang: ty
#norm_quote_commas: true
#norm_numbers: true
sentencepiece:
#src_subword_type: sentencepiece
src_subword_model: data/en.wiki.bpe.vs25000.model
#tgt_subword_type: sentencepiece
tgt_subword_model: models/spm_tgt-train.model
#filtertoolong:
#src_seq_length: 512
#tgt_seq_length: 512
dump_samples: true
n_samples: 1000
# Number of candidates for SentencePiece sampling
#subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
#subword_alpha: 0.1
training:
# Model configuration
model_path: models
keep_checkpoint: 40
save_checkpoint_steps: 1000
train_steps: 40000
valid_steps: 500
#train_from: models/step_7000
bucket_size: 1024
num_workers: 4
prefetch_factor: 6
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 1024
batch_size_multiple: 8
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
compute_dtype: fp16
optim: "adam"
learning_rate: 1.4
average_decay: 0.1
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_method: "xavier_uniform"
#normalization: "tokens"
#early_stopping: 3
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/eole.log
# Pretrained embeddings configuration for the source language
embeddings_type: word2vec
src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v.txt
#tgt_embeddings:
save_data: processed_data/
position_encoding_type: Rotary
model:
architecture: "transformer"
hidden_size: 300
share_decoder_embeddings: false
share_embeddings: false
layers: 6
heads: 6
transformer_ff: 300
word_vec_size: 300
position_encoding: true
Logs here eole.log
The BLEU metric is here, it's just constantly 0.0 (yellow line), as it is in the logs.
Yes indeed. I however get training and validation data from the logs and Tensorboard displays that data fine. Iʻm working on the convergence right now. It would be great to have the Bleu score as well. I checked the Sacrebleu install and it looks good, I have the latest version. Is there another parameter to define in the config file to get Bleu?
This whole BLEU metric stack can be a bit edgy, because it requires to actually run full predictions (as opposed to acc/ppl which work on "step" level). Having a BLEU of 0.0 here means that the predictions on the valid set are not going well (at least in this scope). Does the prediction run properly on a saved checkpoint? If not, you probably still have some config issues. If yes, then there might be something off in the validset or the scoring code path (though the latter should be fine, re-tested recently on the updated WMT17 configs). You can investigate a bit further where the predictions are actually performed: https://github.com/eole-nlp/eole/blob/a92857063c49019038da458afca5237bfe6e5e83/eole/trainer.py#L441-L445 https://github.com/eole-nlp/eole/blob/a92857063c49019038da458afca5237bfe6e5e83/eole/utils/scoring_utils.py#L30
Some notes:
keep_checkpoints
setting) and run some scripts to run various prediction settings and compute scores (BLEU or whatever).Nope. The prediction on a saved checkpoint outputs nothing. It seems to compute, then gives the time spent to compute and thatʻs it. When I look into the tgt-test.txt file the translations are not present.
The config you provided is most probably not the config you are running, as it is not valid. (Or you are still using an older version of the code.)
Also, there is most probably something wrong in said config, as the logs show
[2024-10-17 01:15:00,975 INFO] Scoring with: None
which means no transforms are used in the scoring process (probably explaining the 0 BLEU).
When transforms are enabled, the same line in the logs looks something like this:
[2024-10-23 10:33:36,351 INFO] Scoring with: {'sentencepiece': SentencePieceTransform(share_vocab=False, src_subword_model=wmt17_en_de/spm.model, tgt_subword_model=wmt17_en_de/spm.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1)}
Also, the accurracy is very high, very quick, meaning you have the same issue as before. Please try and make your setup straight. We're running in circles here.
Here is a config adapted from yours, that works properly on the WMT17 data. Note: pretrained embeddings are disabled, because I don't have a quick way to test this, but in any case you should always test a simple setup before trying to finetune it with more specific features like this (which are very often not really worth the hassle, by the way).
## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: wmt17_en_de/vocab.spm.shared
tgt_vocab: wmt17_en_de/vocab.spm.shared
#n_sample: -1
data:
corpus_1:
path_src: wmt17_en_de/train.src.shuf
path_tgt: wmt17_en_de/train.trg.shuf
valid:
path_src: wmt17_en_de/dev.src
path_tgt: wmt17_en_de/dev.trg
transforms: [sentencepiece]
transforms_configs:
#normalize:
#src_lang: en
#tgt_lang: ty
#norm_quote_commas: true
#norm_numbers: true
sentencepiece:
#src_subword_type: sentencepiece
src_subword_model: "wmt17_en_de/spm.model"
#tgt_subword_type: sentencepiece
tgt_subword_model: "wmt17_en_de/spm.model"
#filtertoolong:
#src_seq_length: 512
#tgt_seq_length: 512
# dump_samples: true
# n_samples: 1000
# Number of candidates for SentencePiece sampling
#subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
#subword_alpha: 0.1
training:
# Model configuration
model_path: models
keep_checkpoint: 40
save_checkpoint_steps: 1000
train_steps: 40000
valid_steps: 500
#train_from: models/step_7000
bucket_size: 1024
num_workers: 4
prefetch_factor: 6
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 1024
batch_size_multiple: 1
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
compute_dtype: fp16
optim: "adam"
self_attn_backend: "pytorch"
learning_rate: 1.4
average_decay: 0.1
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_method: "xavier_uniform"
#normalization: "tokens"
#early_stopping: 3
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/eole.log
# Pretrained embeddings configuration for the source language
# embeddings_type: word2vec
# src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v.txt
#tgt_embeddings:
save_data: processed_data/
model:
architecture: "transformer"
hidden_size: 300
share_decoder_embeddings: false
share_embeddings: false
layers: 6
heads: 6
transformer_ff: 300
word_vec_size: 300
embeddings:
position_encoding_type: Rotary
# position_encoding: true
Starting logs (BLEU is veeeeery low, but that's expected here):
[2024-10-23 10:53:49,634 INFO] Default transforms (might be overridden downstream): ['sentencepiece'].
[2024-10-23 10:53:49,634 INFO] Missing transforms field for corpus_1 data, set to default: ['sentencepiece'].
[2024-10-23 10:53:49,635 INFO] Missing transforms field for valid data, set to default: ['sentencepiece'].
[2024-10-23 10:53:49,635 INFO] Parsed 2 corpora from -data.
[2024-10-23 10:53:49,635 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2024-10-23 10:53:49,822 INFO] Transforms applied: ['sentencepiece']
[2024-10-23 10:53:49,829 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', '▁,', '▁.', '▁the', '▁in', '▁die', 's']
[2024-10-23 10:53:49,829 INFO] The decoder start token is: <s>
[2024-10-23 10:53:49,830 INFO] bos_token token is: <s> id: [2]
[2024-10-23 10:53:49,830 INFO] eos_token token is: </s> id: [3]
[2024-10-23 10:53:49,830 INFO] pad_token token is: <blank> id: [1]
[2024-10-23 10:53:49,830 INFO] unk_token token is: <unk> id: [0]
[2024-10-23 10:53:49,830 INFO] Building model...
[2024-10-23 10:53:50,091 INFO] Switching model to float32 for amp/apex_amp
[2024-10-23 10:53:50,091 INFO] Non quantized layer compute is torch.float16
[2024-10-23 10:53:50,288 INFO] EncoderDecoderModel(
(encoder): TransformerEncoder(
(rope): RotaryPosition()
(transformer_layers): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(input_layernorm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
(self_attn): SelfMHA(
(linear_keys): Linear(in_features=300, out_features=300, bias=False)
(linear_values): Linear(in_features=300, out_features=300, bias=False)
(linear_query): Linear(in_features=300, out_features=300, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=300, out_features=300, bias=False)
)
(dropout): Dropout(p=0.1, inplace=False)
(post_attention_layernorm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
(mlp): MLP(
(gate_up_proj): Linear(in_features=300, out_features=300, bias=False)
(down_proj): Linear(in_features=300, out_features=300, bias=False)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
)
)
(layer_norm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(rope): RotaryPosition()
(transformer_layers): ModuleList(
(0-5): 6 x TransformerDecoderLayer(
(input_layernorm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
(self_attn): SelfMHA(
(linear_keys): Linear(in_features=300, out_features=300, bias=False)
(linear_values): Linear(in_features=300, out_features=300, bias=False)
(linear_query): Linear(in_features=300, out_features=300, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=300, out_features=300, bias=False)
)
(dropout): Dropout(p=0.1, inplace=False)
(post_attention_layernorm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
(mlp): MLP(
(gate_up_proj): Linear(in_features=300, out_features=300, bias=False)
(down_proj): Linear(in_features=300, out_features=300, bias=False)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(precontext_layernorm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
(context_attn): ContextMHA(
(linear_keys): Linear(in_features=300, out_features=300, bias=False)
(linear_values): Linear(in_features=300, out_features=300, bias=False)
(linear_query): Linear(in_features=300, out_features=300, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=300, out_features=300, bias=False)
)
)
)
(layer_norm): LayerNorm((300,), eps=1e-06, elementwise_affine=True)
)
(src_emb): Embeddings(
(embeddings): Embedding(32032, 300, padding_idx=1)
(dropout): Dropout(p=0.1, inplace=False)
)
(tgt_emb): Embeddings(
(embeddings): Embedding(32032, 300, padding_idx=1)
(dropout): Dropout(p=0.1, inplace=False)
)
(generator): Linear(in_features=300, out_features=32032, bias=True)
)
[2024-10-23 10:53:50,291 INFO] embeddings: 19219200
[2024-10-23 10:53:50,291 INFO] encoder: 3247800
[2024-10-23 10:53:50,292 INFO] decoder: 5411400
[2024-10-23 10:53:50,292 INFO] generator: 9641632
[2024-10-23 10:53:50,292 INFO] other: 0
[2024-10-23 10:53:50,292 INFO] * number of parameters: 37520032
[2024-10-23 10:53:50,292 INFO] Trainable parameters = {'torch.float32': 37520032}
[2024-10-23 10:53:50,292 INFO] Non trainable parameters = {}
[2024-10-23 10:53:50,292 INFO] * src vocab size = 32032
[2024-10-23 10:53:50,292 INFO] * tgt vocab size = 32032
[2024-10-23 10:53:50,645 INFO] Transforms applied: ['sentencepiece']
[2024-10-23 10:53:50,690 INFO] Starting training on GPU: [0]
[2024-10-23 10:53:50,691 INFO] Start training loop and validate every 500 steps...
[2024-10-23 10:53:50,691 INFO] Scoring with: {'sentencepiece': SentencePieceTransform(share_vocab=False, src_subword_model=wmt17_en_de/spm.model, tgt_subword_model=wmt17_en_de/spm.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1)}
[2024-10-23 10:53:52,678 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2024-10-23 10:53:54,566 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2024-10-23 10:53:56,482 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2024-10-23 10:53:58,389 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2024-10-23 10:54:44,160 INFO] Step 100/40000; acc: 3.0; ppl: 25848.60; xent: 10.16; aux: 0.000; lr: 3.20e-05; sents: 19541; bsz: 547/ 580/20; 10240/10855 tok/s; 53 sec;
[2024-10-23 10:55:29,911 INFO] Step 200/40000; acc: 6.3; ppl: 10318.65; xent: 9.24; aux: 0.000; lr: 6.39e-05; sents: 19637; bsz: 552/ 583/20; 12070/12752 tok/s; 99 sec;
[2024-10-23 10:56:15,758 INFO] Step 300/40000; acc: 7.2; ppl: 3440.75; xent: 8.14; aux: 0.000; lr: 9.59e-05; sents: 19065; bsz: 543/ 574/19; 11843/12516 tok/s; 145 sec;
[2024-10-23 10:57:01,533 INFO] Step 400/40000; acc: 10.6; ppl: 2115.18; xent: 7.66; aux: 0.000; lr: 1.28e-04; sents: 19562; bsz: 541/ 575/20; 11811/12570 tok/s; 191 sec;
[2024-10-23 10:57:46,767 INFO] Step 500/40000; acc: 12.7; ppl: 1506.12; xent: 7.32; aux: 0.000; lr: 1.60e-04; sents: 19435; bsz: 545/ 579/19; 12046/12796 tok/s; 236 sec;
[2024-10-23 10:57:59,765 INFO] valid stats calculation
took: 12.995728731155396 s.
[2024-10-23 10:59:09,389 INFO] The translation of the valid dataset for dynamic scoring
took : 69.62299942970276 s.
[2024-10-23 10:59:09,389 INFO] UPDATING VALIDATION BLEU
That's 100 lines that end in a tokenized period ('.')
It looks like you forgot to detokenize your test data, which may hurt your score.
If you insist your data is detokenized, or don't care, you can suppress this message with the `force` parameter.
[2024-10-23 10:59:11,227 INFO] validation BLEU: 0.03712901977122912
[2024-10-23 10:59:11,229 INFO] Train perplexity: 4952.25
[2024-10-23 10:59:11,230 INFO] Train accuracy: 7.94452
[2024-10-23 10:59:11,230 INFO] Sentences processed: 97240
[2024-10-23 10:59:11,230 INFO] Average bsz: 546/ 578/19
[2024-10-23 10:59:11,230 INFO] Validation perplexity: 1388.77
[2024-10-23 10:59:11,230 INFO] Validation accuracy: 14.8651
Hi François, I updated to the latest Eole NLP code. Thank you for trying to rewrite my config but that is not my current setup. Iʻm not using shuf files, .src, .trg or .shared files. Iʻm using .onmt_vocab files converted from SentencePiece-generated .vocab files for the vocabs and .txt files for the dataset. I tried several configurations, stripping the pretrained embeddings, I mimicked your config (except for the .shuf, .src, .trg and .shared files) and I rearranged my config file as follows:
## IO
overwrite: True
seed: 1234
report_every: 100
valid_metrics: ["BLEU"]
### Vocab
src_vocab: processed_data/spm_src-train.onmt_vocab
tgt_vocab: processed_data/spm_tgt-train.onmt_vocab
#n_sample: -1
data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
valid:
path_src: data/EN-val.txt
path_tgt: data/TY-val.txt
#transforms: [onmt_tokenize]
transforms: [sentencepiece]
transforms_configs:
#normalize:
#src_lang: en
#tgt_lang: ty
#norm_quote_commas: true
#norm_numbers: true
sentencepiece:
#src_subword_type: sentencepiece
src_subword_model: "models/en.wiki.bpe.vs25000.model"
#tgt_subword_type: sentencepiece
tgt_subword_model: "models/spm_tgt-train.model"
#filtertoolong:
#src_seq_length: 512
#tgt_seq_length: 512
#dump_samples: true
#n_samples: 1000
# Number of candidates for SentencePiece sampling
#subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
#subword_alpha: 0.1
training:
# Model configuration
model_path: models
keep_checkpoint: 40
save_checkpoint_steps: 1000
train_steps: 40000
valid_steps: 500
#train_from: models/step_7000
bucket_size: 1024
num_workers: 4
prefetch_factor: 6
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 1024
batch_size_multiple: 1
accum_count: [10]
accum_steps: [0]
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
compute_dtype: fp16
optim: "adam"
self_attn_backend: "pytorch"
learning_rate: 1.4
average_decay: 0.1
warmup_steps: 4000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_method: "xavier_uniform"
#normalization: "tokens"
#early_stopping: 3
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/eole.log
# Pretrained embeddings configuration for the source language
#embeddings_type: word2vec
#src_embeddings: data/en.wiki.bpe.vs25000.d300.w2v.txt
#tgt_embeddings:
save_data: processed_data/
model:
architecture: "transformer"
hidden_size: 300
share_decoder_embeddings: false
share_embeddings: false
layers: 6
heads: 6
transformer_ff: 300
word_vec_size: 300
embeddings:
position_encoding_type: Rotary
#position_encoding: true
The Zero Blue score problem is still here, as well as the quick accuracy score shooting up. Also, when you say "no transforms are used in the scoring process", are we supposed to define transforms in the config file just for scoring?
Iʻm not using shuf files, .src, .trg or .shared files. Iʻm using .onmt_vocab files converted from SentencePiece-generated .vocab files for the vocabs and .txt files for the dataset.
There is no such thing as ".shuf" ".src" ".trg" ".shared" files. All these are just txt files. ".shuf" means the data was shuffled (https://github.com/eole-nlp/eole/blob/main/recipes/wmt17/prepare_wmt_ende_data.sh), src and trg are placeholders to identify source/target data, .shared means the vocab was learned on both source and target.
The issue is most probably that you have a mismatch between the vocab you generated and the tokenization produced by the model. Though I'm really not sure to follow here, this issue appeared to be solved in the other conversation.
Quick solution that might work: Try and run build_vocab
to create a vocab with your transformed data instead of relying on the sentencepiece one.
Actual recommendation: You need to understand the basic concepts at stake here, else you won't go very far. And, again, run the WMT17 config, and build up from that. Once this runs on your setup, you can do something as simple as changing the source/target data to using yours and re-run all the scripts. "I'm using sentencepiece" is not an excuse here, as I just provided updated scripts and configs.
Also, when you say "no transforms are used in the scoring process", are we supposed to define transforms in the config file just for scoring?
No, it's supposed to be grabbed from the general setting. That's one of the reasons that lead me to believe there is a mismatch between the config you shared and the one that actually run according to the provided logs.
Yeah thatʻs what i think too. Iʻll triple check the models and corresponding vocabs, and rebuild the vocabs if necessary. Thanks François for your patience,
Yoohoo!
[2024-10-24 08:27:32,565 INFO] Weighted corpora loaded so far:
* corpus_1: 6
[2024-10-24 08:27:32,989 INFO] Step 900/40000; acc: 30.5; ppl: 115.82; xent: 4.75; aux: 0.000; lr: 3.11e-04; sents: 50818; bsz: 455/ 659/51; 9965/14436 tok/s; 426 sec;
[2024-10-24 08:28:18,549 INFO] Step 1000/40000; acc: 32.2; ppl: 100.39; xent: 4.61; aux: 0.000; lr: 3.46e-04; sents: 45290; bsz: 430/ 616/45; 9436/13516 tok/s; 471 sec;
[2024-10-24 08:28:29,387 INFO] valid stats calculation
took: 10.834146738052368 s.
[2024-10-24 08:28:29,390 WARNING] xavier_uniform initialization does not require param_init (0.1)
[2024-10-24 08:28:43,600 INFO] The translation of the valid dataset for dynamic scoring
took : 14.212685823440552 s.
[2024-10-24 08:28:43,600 INFO] UPDATING VALIDATION BLEU
[2024-10-24 08:28:43,853 INFO] validation BLEU: 0.46592065692117035
[2024-10-24 08:28:43,855 INFO] Train perplexity: 421.161
[2024-10-24 08:28:43,856 INFO] Train accuracy: 20.6643
[2024-10-24 08:28:43,856 INFO] Sentences processed: 492886
[2024-10-24 08:28:43,856 INFO] Average bsz: 442/ 630/49
[2024-10-24 08:28:43,856 INFO] Validation perplexity: 558.424
[2024-10-24 08:28:43,856 INFO] Validation accuracy: 19.4927
[2024-10-24 08:28:43,860 INFO] Saving optimizer and weights to step_1000, and symlink to models
[2024-10-24 08:28:44,530 INFO] Saving transforms artifacts, if any, to models
[2024-10-24 08:28:44,532 INFO] Saving config and vocab to models
[2024-10-24 08:29:15,102 INFO] * Transform statistics for corpus_1(25.00%):
* SubwordStats: 434561 -> 473459 tokens
It now works! I had to rebuild the models and onmt_vocabs. Thank you so much François!!
Hi, Iʻm training a bilingual model with the transformer. The Bleu score is not working. Iʻm using Tensorboard to view the performance.
All the metrics above display data, except valid/BLEU. How can I go about getting Bleu score working?