UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.85k stars 2.44k forks source link

Failed on training embeddings on new language (French) #329

Open dragonlee97 opened 4 years ago

dragonlee97 commented 4 years ago

With reference of these codes: examples/training_multilingual/make_multilingual.py, I want to have a french model, but encountered this when loading:

word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)

_OSError:

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)

in 3 4 logging.info("Create student model from scratch") ----> 5 word_embedding_model = models.Transformer('xlm-roberta-base', max_seq_length=max_seq_length) 6 # Apply mean pooling to get one fixed sized sentence vector 7 pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) /opt/conda/envs/rapids/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py in __init__(self, model_name_or_path, max_seq_length, model_args, cache_dir) 16 self.max_seq_length = max_seq_length 17 ---> 18 config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir) 19 self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir) 20 self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir) /opt/conda/envs/rapids/lib/python3.6/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs) 201 202 """ --> 203 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) 204 205 if "model_type" in config_dict: /opt/conda/envs/rapids/lib/python3.6/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs) 249 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n" 250 ) --> 251 raise EnvironmentError(msg) 252 253 except json.JSONDecodeError: OSError: Can't load config for 'xlm-roberta-base'. Make sure that: - 'xlm-roberta-base' is a correct model identifier listed on 'https://huggingface.co/models' - or 'xlm-roberta-base' is the correct path to a directory containing a config.json file_ **Is it because of intranet problem? I have upgraded all the packages/models in python** or is there any pre-trained model just for french? I used the mulitlingual model, but gives bad embeddings
nreimers commented 4 years ago

Hi @dragonlee97

Multilingual models including French are currently trained.

It appears that transformers cannot download the model. Is your transformers version to most recent? Can the system access the S3 bucket from HuggingFace?

Try this code:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

model = AutoModelWithLMHead.from_pretrained("xlm-roberta-base")
dragonlee97 commented 4 years ago

Hi @dragonlee97

Multilingual models including French are currently trained.

It appears that transformers cannot download the model. Is your transformers version to most recent? Can the system access the S3 bucket from HuggingFace?

Try this code:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

model = AutoModelWithLMHead.from_pretrained("xlm-roberta-base")

Yes, my transformers version is 3.0.2. I think it's because of internet restriction in the company, how to test the access of S3 bucket ? Otherwise, is it possible to download the models manually?

nreimers commented 4 years ago

With the above code.

You can find the model otherwise here: https://huggingface.co/xlm-roberta-base

There is a link 'List all files', where you can download the needed files for this model