UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.74k stars 2.43k forks source link

Error when loading "paraphrase-multilingual-MiniLM-L12-v2" after a fine-tuning step #1019

Closed lucasBOYER closed 2 years ago

lucasBOYER commented 3 years ago

Hi, First, thanks a lot for all your continuous work on that library, it provides so many valuable features !

Context of the issue

I've got an issue with the "paraphrase-multilingual-MiniLM-L12-v2" model though. I'm fine-tuning it on my domain specific dataset using a triplet loss, but I've got an error when I try to load that fine-tuned version of the model elsewhere.

I did the exact same pipeline with other multilingual models for benchmark purposes, "paraphrase-multilingual-mpnet-base-v2" and "distiluse-base-multilingual-cased-v1". The error did not occur for those other two.

I guess it might be related to the tokenizer config of the model as I could read on other issues (#1010, #975), but also the way the model is saved after a call to the "fit" method.

I'm using the 1.2.0 version of the lib.

Traceback

Assuming that I've specified "./batch-hard-miniLM" as the output_path when fitting, here is the complete traceback after I run model = SentenceTransformer("batch-hard-miniLM")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-246a0c877bce> in <module>
----> 1 model = SentenceTransformer("batch-hard-miniLM")

/opt/conda/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py in __init__(self, model_name_or_path, modules, device)
    119                 for module_config in contained_modules:
    120                     module_class = import_from_string(module_config['type'])
--> 121                     module = module_class.load(os.path.join(model_path, module_config['path']))
    122                     modules[module_config['name']] = module
    123 

/opt/conda/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py in load(input_path)
    109         with open(sbert_config_path) as fIn:
    110             config = json.load(fIn)
--> 111         return Transformer(model_name_or_path=input_path, **config)
    112 
    113 

/opt/conda/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py in __init__(self, model_name_or_path, max_seq_length, model_args, cache_dir, tokenizer_args, do_lower_case)
     27         config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
     28         self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
---> 29         self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir, **tokenizer_args)
     30 
     31 

/opt/conda/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    421             tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    422             if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 423                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    424             else:
    425                 if tokenizer_class_py is not None:

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1707                 logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
   1708 
-> 1709         return cls._from_pretrained(
   1710             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1711         )

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1720         has_tokenizer_file = resolved_vocab_files.get("tokenizer_file", None) is not None
   1721         if (from_slow or not has_tokenizer_file) and cls.slow_tokenizer_class is not None:
-> 1722             slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
   1723                 copy.deepcopy(resolved_vocab_files),
   1724                 pretrained_model_name_or_path,

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1779         # Instantiate tokenizer.
   1780         try:
-> 1781             tokenizer = cls(*init_inputs, **init_kwargs)
   1782         except OSError:
   1783             raise OSError(

/opt/conda/lib/python3.9/site-packages/transformers/models/bert/tokenization_bert.py in __init__(self, vocab_file, do_lower_case, do_basic_tokenize, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, tokenize_chinese_chars, strip_accents, **kwargs)
    191         )
    192 
--> 193         if not os.path.isfile(vocab_file):
    194             raise ValueError(
    195                 f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "

/opt/conda/lib/python3.9/genericpath.py in isfile(path)
     28     """Test whether a path is a regular file"""
     29     try:
---> 30         st = os.stat(path)
     31     except (OSError, ValueError):
     32         return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Illustration of the fine-tuning code used

the code I ran for fine-tuning :

from sentence_transformers import SentencesDataset, SentenceTransformer, losses
from torch.utils.data import DataLoader
from sentence_transformers.readers import InputExample

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
inputs = [
    InputExample(texts=[row.content], label=row.label)
    for i, row in some_df.loc[:, ["content", "label"]].iterrows()
]

train_dataloader = DataLoader(inputs, shuffle=True, batch_size=64)
warmup_steps = int(len(train_dataloader) * epochs * 0.1)  # 10% of train data
train_loss = losses.BatchHardTripletLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=2,
    warmup_steps=warmup_steps,
    output_path="batch-hard-miniLM",
)

Am I doing something wrong here ? How can I fix that ? Thanks :)

nreimers commented 3 years ago

Yes, somehow this model causes quite some problems with the weird setup of tokenizer and model.

Did you try to update tokenizers to the latest version:

pip install -U tokenizers
lucasBOYER commented 3 years ago

Thanks for the quick reply ! I've just updated 'tokenizers' to the version '0.10.3' and tried to load the model, but sadly I still have the same error popping.

nreimers commented 3 years ago

Can you also update transformers to the latest version?

lucasBOYER commented 3 years ago

I've updated the transformers lib to 4.7.0, retrained the model and now everything works fine. It was as simple as that, it seems ! Thanks a lot :)

nreimers commented 3 years ago

Happy to hear that.

Yes transformers / tokenizers have some issues in the older version with this mixture of model architecture & tokenizer.

vjeronymo2 commented 2 years ago

I can also confirm that updating the transformers lib (4.11.3) solved the problem for me.