Finetuning bge embedding: Why do we need to save_ckpt_for_sentence_transformers when saving the model?

yuxinxu77 commented 7 months ago

Can someone explain why we need to save the checkpoint for sentence transformers? Since we are already saving the model as a safetensors file inside the output directory, what's the point of saving the checkpoints in the sentence-transformers way?

Reason I asked: When I finetune the embedding model and try to save using trainer.save_model(), it is able to save the model into safetensors file, but the save_sentence_transformers_lib function at the end of the _save method always throws an error.

Thank you!

yuxinxu77 commented 7 months ago

If save_sentence_transformers_lib isn't necessary for future work, then we can skip the following part. But just in case somebody is curious about the error, here are some info:

Basic Info

Platform: Azure Databricks
Runtime: 14.3 LTS ML
Compute: GPU Single node all purpose Cluster
Context: I'm running in a notebook that copies the code from run.py and fixed some import dependencies due to the change.
Extra package installation: faiss-gpu==1.7.2

Error Message

OSError: No such device (os error 19) File , line 1 ----> 1 trainer.save_model()

File /databricks/python/lib/python3.10/site-packages/transformers/trainer.py:2842, in Trainer.save_model(self, output_dir, _internal_call) 2839 self.model_wrapped.save_checkpoint(output_dir) 2841 elif self.args.should_save: -> 2842 self._save(output_dir) 2844 # Push to the Hub when save_model is called by the user. 2845 if self.args.push_to_hub and not _internal_call:

File /Workspace/Users/user1/FlagEmbedding/finetune/trainer.py:37, in BiTrainer._save(self, output_dir, state_dict) 35 # save the checkpoint for sentence-transformers library 36 if self.is_world_process_zero(): ---> 37 save_ckpt_for_sentence_transformers(output_dir, 38 pooling_mode=self.args.sentence_pooling_method, 39 normlized=self.args.normlized)

File /Workspace/Users/user1/FlagEmbedding/finetune/trainer.py:7, in save_ckpt_for_sentence_transformers(ckpt_dir, pooling_mode, normlized) 5 def save_ckpt_for_sentence_transformers(ckpt_dir, pooling_mode: str = 'cls', normlized: bool=True): 6 print(f"ckpt_dir: {ckpt_dir}") ----> 7 word_embedding_model = models.Transformer(ckpt_dir) 8 pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode=pooling_mode) 9 if normlized:

File /databricks/python/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py:29, in Transformer.init(self, model_name_or_path, max_seq_length, model_args, cache_dir, tokenizer_args, do_lower_case, tokenizer_name_or_path) 26 self.do_lower_case = do_lower_case 28 config = AutoConfig.from_pretrained(model_name_or_path, model_args, cache_dir=cache_dir) ---> 29 self._load_model(model_name_or_path, config, cache_dir) 31 self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path if tokenizer_name_or_path is not None else model_name_or_path, cache_dir=cache_dir, tokenizer_args) 33 #No max_seq_length set. Try to infer from model

File /databricks/python/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py:49, in Transformer._load_model(self, model_name_or_path, config, cache_dir) 47 self._load_t5_model(model_name_or_path, config, cache_dir) 48 else: ---> 49 self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)

File /databricks/python/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:566, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, *kwargs) 564 elif type(config) in cls._model_mapping.keys(): 565 model_class = _get_model_class(config, cls._model_mapping) --> 566 return model_class.from_pretrained( 567 pretrained_model_name_or_path, model_args, config=config, hub_kwargs, kwargs 568 ) 569 raise ValueError( 570 f"Unrecognized configuration class {config.class} for this kind of AutoModel: {cls.name}.\n" 571 f"Model type should be one of {', '.join(c.name for c in cls._model_mapping.keys())}." 572 )

File /databricks/python_shell/dbruntime/huggingface_patches/transformers.py:21, in _create_patch_function..patched_from_pretrained(cls, *args, *kwargs) 19 call_succeeded = False 20 try: ---> 21 model = original_method.func(cls, args, **kwargs) 22 call_succeeded = True 23 return model

File /databricks/python/lib/python3.10/site-packages/transformers/modeling_utils.py:3359, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs) 3339 resolved_archive_file, sharded_metadata = get_checkpoint_shard_files( 3340 pretrained_model_name_or_path, 3341 resolved_archive_file, (...) 3351 _commit_hash=commit_hash, 3352 ) 3354 if ( 3355 is_safetensors_available() 3356 and isinstance(resolved_archive_file, str) 3357 and resolved_archive_file.endswith(".safetensors") 3358 ): -> 3359 with safe_open(resolved_archive_file, framework="pt") as f: 3360 metadata = f.metadata() 3362 if metadata.get("format") == "pt":

yuxinxu77 commented 7 months ago

The error is quite insteresting...Since the error occurs inside transformers' AutoModel class, I directly import transformers.AutoModel and call the from_pretrained method. It works flawlessly. Not sure why it's having a problem when called by the sentence-transformers package.

staoxiao commented 7 months ago

@yuxinxu77 , save_ckpt_for_sentence_transformers is used to convert the model into the format of sentence_transformers. In this way, users can load the fine-tuned model with sentence_transformers. If you don't use sentence_transformers, you can skip this step.

We haven't met this error. You can try to change the version of sentence_transformers(we use version 2.6.0)

yuxinxu77 commented 7 months ago

Thank you for the clarification! Now I can proceed with no worry.

FlagOpen / FlagEmbedding

Finetuning bge embedding: Why do we need to save_ckpt_for_sentence_transformers when saving the model? #716

Basic Info

Error Message