eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
45 stars 9 forks source link

Fasttext Pre-trained embeddings supported? Pydantic error. Extra inputs are not permitted #88

Open HURIMOZ opened 3 weeks ago

HURIMOZ commented 3 weeks ago

Hi everyone, I want to use 256-dimension-truncated pre-trained embeddings from fasttext. I get this error:

(TY-EN) ubuntu@ip-172-31-2-199:~/TY-EN/eole/recipes/wmt17$ eole train --config wmt17_enty.yaml
Traceback (most recent call last):
  File "/home/ubuntu/TY-EN/TY-EN/bin/eole", line 33, in <module>
    sys.exit(load_entry_point('eole', 'console_scripts', 'eole')())
  File "/home/ubuntu/TY-EN/eole/eole/bin/main.py", line 39, in main
    bin_cls.run(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/train.py", line 68, in run
    config = cls.build_config(args)
  File "/home/ubuntu/TY-EN/eole/eole/bin/run/__init__.py", line 42, in build_config
    config = cls.config_class(**config_dict)
  File "/home/ubuntu/TY-EN/TY-EN/lib/python3.10/site-packages/pydantic/main.py", line 193, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for TrainConfig
model.transformer.embeddings.embeddings_type
  Extra inputs are not permitted [type=extra_forbidden, input_value='word2vec', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/extra_forbidden
model.transformer.embeddings.src_embeddings
  Extra inputs are not permitted [type=extra_forbidden, input_value='data/cc.en.256.txt', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/extra_forbidden

Are pre-trained embeddings supported? If so, is word2vec supported?

Here is the relevant part of my config file:

model:
    architecture: "transformer"
    hidden_size: 256
    share_decoder_embeddings: true
    share_embeddings: false
    layers: 6
    heads: 8
    transformer_ff: 256
# Pretrained embeddings configuration for the source language
    embeddings:
        word_vec_size: 256
        position_encoding_type: "SinusoidalInterleaved"
        embeddings_type: "word2vec"
        src_embeddings: data/cc.en.256.txt
francoishernandez commented 3 weeks ago

It should in theory be supported, but was not extensively retested in quite some time. I think your main issue here is that it should not be in the model's embeddings config, but at the "root" level of the config. (Such things should be made clearer, I agree.) See the corresponding config definitions around here: https://github.com/eole-nlp/eole/blob/5120fdbd06132cd7d16b9fe65384c2affe95b199/eole/config/data.py#L52-L54