Failed on training embeddings on new language (French)

dragonlee97 commented 4 years ago

With reference of these codes: examples/training_multilingual/make_multilingual.py, I want to have a french model, but encountered this when loading:

word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)

_OSError:

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)

in 3 4 logging.info("Create student model from scratch") ----> 5 word_embedding_model = models.Transformer('xlm-roberta-base', max_seq_length=max_seq_length) 6 # Apply mean pooling to get one fixed sized sentence vector 7 pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension()) /opt/conda/envs/rapids/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py in __init__(self, model_name_or_path, max_seq_length, model_args, cache_dir) 16 self.max_seq_length = max_seq_length 17 ---> 18 config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir) 19 self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir) 20 self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir) /opt/conda/envs/rapids/lib/python3.6/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs) 201 202 """ --> 203 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) 204 205 if "model_type" in config_dict: /opt/conda/envs/rapids/lib/python3.6/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs) 249 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n" 250 ) --> 251 raise EnvironmentError(msg) 252 253 except json.JSONDecodeError: OSError: Can't load config for 'xlm-roberta-base'. Make sure that: - 'xlm-roberta-base' is a correct model identifier listed on 'https://huggingface.co/models' - or 'xlm-roberta-base' is the correct path to a directory containing a config.json file_ **Is it because of intranet problem? I have upgraded all the packages/models in python** or is there any pre-trained model just for french? I used the mulitlingual model, but gives bad embeddings

nreimers commented 4 years ago

Hi @dragonlee97

Multilingual models including French are currently trained.

It appears that transformers cannot download the model. Is your transformers version to most recent? Can the system access the S3 bucket from HuggingFace?

Try this code:

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

model = AutoModelWithLMHead.from_pretrained("xlm-roberta-base")

dragonlee97 commented 4 years ago

Hi @dragonlee97

Multilingual models including French are currently trained.

It appears that transformers cannot download the model. Is your transformers version to most recent? Can the system access the S3 bucket from HuggingFace?

Try this code:
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

model = AutoModelWithLMHead.from_pretrained("xlm-roberta-base")

Yes, my transformers version is 3.0.2. I think it's because of internet restriction in the company, how to test the access of S3 bucket ? Otherwise, is it possible to download the models manually?

nreimers commented 4 years ago

With the above code.

You can find the model otherwise here: https://huggingface.co/xlm-roberta-base

There is a link 'List all files', where you can download the needed files for this model

UKPLab / sentence-transformers

Failed on training embeddings on new language (French) #329