Open randomicity opened 4 months ago
Okay apparently these are already on pull requests, I will test those.
Okay I tried the modifications that are not yet committed and I got this error:
The model has num_kv: 8 and heads: 16 which I understand is the setup for GQA.
post your yaml file (of onmt-py training)
[2024-05-17 15:57:29,656 INFO] Loading checkpoint from Saved_Data/Models/fr_en_step_175000.pt
[2024-05-17 15:57:31,659 INFO] Building model...
[2024-05-17 15:57:31,930 INFO] Switching model to float32 for amp/apex_amp
[2024-05-17 15:57:31,930 INFO] Non quantized layer compute is fp16
[2024-05-17 15:57:32,169 INFO] NMTModel(
(encoder): TransformerEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(32000, 1024, padding_idx=1)
)
)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=512, bias=False)
(linear_values): Linear(in_features=1024, out_features=512, bias=False)
(linear_query): Linear(in_features=1024, out_features=1024, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=False)
(w_2): Linear(in_features=4096, out_features=1024, bias=False)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(32000, 1024, padding_idx=1)
)
)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(transformer_layers): ModuleList(
(0-5): 6 x TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=512, bias=False)
(linear_values): Linear(in_features=1024, out_features=512, bias=False)
(linear_query): Linear(in_features=1024, out_features=1024, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=False)
(w_2): Linear(in_features=4096, out_features=1024, bias=False)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=512, bias=False)
(linear_values): Linear(in_features=1024, out_features=512, bias=False)
(linear_query): Linear(in_features=1024, out_features=1024, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=False)
)
(layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
)
)
(generator): Linear(in_features=1024, out_features=32000, bias=True)
)
[2024-05-17 15:57:32,171 INFO] encoder: 101988352
[2024-05-17 15:57:32,171 INFO] decoder: 153675008
[2024-05-17 15:57:32,171 INFO] * number of parameters: 255663360
[2024-05-17 15:57:32,171 INFO] Trainable parameters = {'torch.float32': 255663360, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2024-05-17 15:57:32,171 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2024-05-17 15:57:32,171 INFO] * src vocab size = 32000
[2024-05-17 15:57:32,171 INFO] * tgt vocab size = 32000
[2024-05-17 15:57:32,829 INFO] Starting training on GPU: [0]
[2024-05-17 15:57:32,830 INFO] Start training loop and validate every 10000 steps...
[2024-05-17 15:57:32,830 INFO] Scoring with: ['sentencepiece', 'filtertoolong']
# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2.0
warmup_steps: 50000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 5
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
encoder_type: transformer
decoder_type: transformer
self_attn_type: scaled-dot-flash
position_encoding: false
parallel_residual: true
shared_layer_norm: true
multiquery: true
num_kv: 8
max_relative_positions: -1
pos_ffn_activation_fn: "relu"
enc_layers: 6
dec_layers: 6
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
Yeah it's unclear between multiquery and num_kv. you need to force multiquery to false in your checkpoint. you load it manually and change i and save your checkpoint multiquery=True should be only for num_kv=1 but I know it's unclear in the docs/examples.
I know it's hard to document everything, thank you for your immense work Vincent. On the OpenNMT-py side setting num_kv to half of the number of heads seems to be working okay, I tried some short runs.
Hello,
I'm trying to convert a model with CTranslate2 4.3 that has been trained with OpenNMT-py 3.5.1 but I get this error:
Converting Saved_Data/Models/fr_en_step_195000.pt to ctranslate2 format... Traceback (most recent call last): File "/home/username/anaconda3/envs/neu/bin/ct2-opennmt-py-converter", line 8, in
sys.exit(main())
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 355, in main
OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
return self.convert(
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 89, in convert
model_spec = self._load()
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 181, in _load
check_opt(checkpoint["opt"], num_source_embeddings=len(src_vocabs))
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 55, in check_opt
check.validate()
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/utils.py", line 106, in validate
raise_unsupported(self._unsupported_reasons)
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/utils.py", line 93, in raise_unsupported
raise ValueError(message)
ValueError: The model you are trying to convert is not supported by CTranslate2. We identified the following reasons:
I trained the model using Flash Attention in OpenNMY-py 3.5.1:
self_attn_type: scaled-dot-flash
If I modify the opennmt_py.py converter to accept scaled-dot-flash by replacing scaled-dot with it I once again get this error:
Traceback (most recent call last): File "/home/username/anaconda3/envs/neu/bin/ct2-opennmt-py-converter", line 8, in
sys.exit(main())
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 355, in main
OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
return self.convert(
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 89, in convert
model_spec = self._load()
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 200, in _load
return _get_model_spec_seq2seq(
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 90, in _get_model_spec_seq2seq
set_transformer_spec(model_spec, variables)
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 210, in set_transformer_spec
set_transformer_encoder(spec.encoder, variables)
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 215, in set_transformer_encoder
set_input_layers(spec, variables, "encoder")
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 241, in set_input_layers
set_position_encodings(
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 341, in set_position_encodings
spec.encodings = _get_variable(variables, "%s.pe" % scope).squeeze()
File "/home/username/anaconda3/envs/neu/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 345, in _get_variable
return variables[name]
KeyError: 'encoder.embeddings.make_embedding.pe.pe'
Probably because it can't handle RoPE, my settings are:
position_encoding: false max_relative_positions: -1
The model trains and inferences without problems in OpenNMT-py.