UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.03k stars 2.45k forks source link

Can't save quantized models #2923

Closed lamashnikov closed 1 month ago

lamashnikov commented 1 month ago

Dears Maintainers,

I've followed this tutorial to quantize the paraphrase-MiniLM-L6-v2 model: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/distillation/model_quantization.py And it work great, my model work well, it take much less space and i'm very happy with it.

But when i try to save it (with save_to_the_hub or save_pretrained), i have this error

Traceback (most recent call last): File "/home/censored/perso/./my_project.py", line 190, in main() File "/home/censored/perso/./my_project.py", line 171, in main q_model.save_pretrained("lamashnikov/cls-quantitized-paraphrase-MiniLM-L6-v2") File "/home/censored/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 1072, in save_pretrained self.save( File "/home/sencored/.local/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 1037, in save module.save(model_path, safe_serialization=safe_serialization) File "/home/censored/.local/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 180, in save self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization) File "/home/censored/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2698, in save_pretrained shared_names, disjoint_names = _find_disjoint(shared_ptrs.values(), state_dict) File "/home/censored/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 650, in _find_disjoint areas.append((tensor.data_ptr(), _end_ptr(tensor), name)) AttributeError: 'torch.dtype' object has no attribute 'data_ptr'

I'm able to save it to pickle format and restore it, but i wanted to save it to the hub so it's kind of annoying.

I can't either save it to safetensor format

save_file(q_model.state_dict(), 'model.safetensors') File "/home/censored/.local/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "/home/censored/.local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten raise ValueError(f"Key {k} is invalid, expected torch.Tensor but received {type(v)}") ValueError: Key 0.auto_model.embeddings.word_embeddings._packed_params.dtype is invalid, expected torch.Tensor but received <class 'torch.dtype'>

Do you have any hints to make it work? Do you know what's happening? My google-fu didn't helped me there and i'm sorry to annoy you with that

Regards

lamashnikov commented 1 month ago

After digging a little it seems that quantized models have a lot more information than """"normal"""" models, including some meta-data such as the type , and safe-tensors serialization wasn't expecting such meta-data. Seems that the name is also changing from quantized to not quantized

quantized state dict: q_model state dict: <class 'torch.dtype'> 0.auto_model.embeddings.word_embeddings._packed_params.dtype <class 'torch.Tensor'> 0.auto_model.embeddings.word_embeddings._packed_params._packed_weight <class 'torch.dtype'> 0.auto_model.embeddings.position_embeddings._packed_params.dtype <class 'torch.Tensor'> 0.auto_model.embeddings.position_embeddings._packed_params._packed_weight <class 'torch.dtype'> 0.auto_model.embeddings.token_type_embeddings._packed_params.dtype <class 'torch.Tensor'> 0.auto_model.embeddings.token_type_embeddings._packed_params._packed_weight <class 'torch.Tensor'> 0.auto_model.embeddings.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.embeddings.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.0.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.0.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.1.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.1.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.2.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.2.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.3.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.3.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.4.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.4.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.query.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.query.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.attention.self.query._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.attention.self.query._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.key.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.key.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.attention.self.key._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.attention.self.key._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.value.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.value.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.attention.self.value._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.attention.self.value._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.attention.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.attention.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.intermediate.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.intermediate.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.intermediate.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.intermediate.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.dense.scale <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.dense.zero_point <class 'torch.dtype'> 0.auto_model.encoder.layer.5.output.dense._packed_params.dtype <class 'tuple'> 0.auto_model.encoder.layer.5.output.dense._packed_params._packed_params <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.pooler.dense.scale <class 'torch.Tensor'> 0.auto_model.pooler.dense.zero_point <class 'torch.dtype'> 0.auto_model.pooler.dense._packed_params.dtype <class 'tuple'> 0.auto_model.pooler.dense._packed_params._packed_params

not quantized state dict:

model state dict: <class 'torch.Tensor'> 0.auto_model.embeddings.word_embeddings.weight <class 'torch.Tensor'> 0.auto_model.embeddings.position_embeddings.weight <class 'torch.Tensor'> 0.auto_model.embeddings.token_type_embeddings.weight <class 'torch.Tensor'> 0.auto_model.embeddings.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.embeddings.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.0.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.1.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.2.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.3.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.4.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.query.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.query.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.key.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.key.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.value.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.self.value.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.attention.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.intermediate.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.intermediate.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.dense.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.dense.bias <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.LayerNorm.weight <class 'torch.Tensor'> 0.auto_model.encoder.layer.5.output.LayerNorm.bias <class 'torch.Tensor'> 0.auto_model.pooler.dense.weight <class 'torch.Tensor'> 0.auto_model.pooler.dense.bias

The place where it explode (expecting a tensor, got meta-data), in "transformers/modeling_utils.py", line 650,

    for name in shared:
        tensor = state_dict[name]
        areas.append((tensor.data_ptr(), _end_ptr(tensor), name))
    areas.sort()

Seems that this issue isn't a Sentence-Transformer issue