MinishLab / model2vec

Distill a Small Static Model from any Sentence Transformer
https://minishlab.github.io/
MIT License
413 stars 18 forks source link

Question: Object has no attribute 'backend_tokenizer'` #115

Closed FahadEbrahim closed 2 weeks ago

FahadEbrahim commented 2 weeks ago

Hi,

I'm experimenting with a model called PLBART. But, I'm getting this error. It seems this is because the fast tokenizer is missing from the model. What are your suggestions or recommendations?

The code:

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_distillation(
    "uclanlp/plbart-java-cs",
    device="cuda",
    pca_dims=256,
    apply_zipf=True,
)
model = SentenceTransformer(modules=[static_embedding])

The error:

AttributeError: 'PLBartTokenizer' object has no attribute 'backend_tokenizer'

Thanks, Fahad.

stephantul commented 2 weeks ago

Hey @FahadEbrahim ,

Thanks for your issue. I am sorry to say that we actually don't have a workaround at all for this issue, and are unlikely to get a workaround soon. Supporting Fixing this would require us to rewrite the tokenization strategy, which is now super optimized for speed.

I also want to point out that this because the plbart model you are using does not do tokenization completely in line with how a standard Transformers works, i.e., they are missing various configuration files related to tokenization, which I think prevents auto-conversion to a fast tokenizer.

Sorry to not have better news, Stéphan

FahadEbrahim commented 2 weeks ago

Hi @stephantul,

Thank you very much for your prompt assistance and feedback.

Regards, Fahad.