UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.53k stars 2.41k forks source link

TSDAE model compatibility (ALBERT, DeBERTa) #2771

Open bobox2997 opened 1 month ago

bobox2997 commented 1 month ago

Hi everyone.

Why TSDAE doesn't accept Albert or deberta/DeBERTaV2 models type?

tomaarsen commented 1 month ago

Hello!

I'd have to verify to be sure, but I think this is because the TSDAE approach requires an encoder-decoder model to be loadable in transformers, and ALBERT/DeBERTa (v2) might not have the required modeling files in transformers for the decoder portion.

Also, good job on your recent models, e.g. https://huggingface.co/bobox/DeBERTaV3-small-GeneralSentenceTransformer-v3-step1 with all those training datasets.

bobox2997 commented 1 month ago

I'd have to verify to be sure, but I think this is because the TSDAE approach requires an encoder-decoder model to be loadable in transformers, and ALBERT/DeBERTa (v2) might not have the required modeling files in transformers for the decoder portion.

Oh, that makes perfect sense, thank you! I apologize if the question seemed a bit naive; I'm still learning about all of this. Is there a specific technical reason for this limitation, or is it more a matter of these models not being as widely adopted or supported?

Also, good job on your recent models, e.g. https://huggingface.co/bobox/DeBERTaV3-small-GeneralSentenceTransformer-v3-step1 with all those training datasets.

Thank you so much for your words! I'm still working on finding the right balance of hyperparameters and dataset proportions to maintain good quality in both STS (Semantic Textual Similarity) and retrieval tasks before scaling these tests to DeBERTaV3-large and DeBERTaV2-xxl (1.5B). I've found that DeBERTa models, both v2 and the "ELECTRA-style" v3, seem to provide a good foundation for these tasks. .

Given your experience with sentence transformers, do you have any general advice for someone working on fine-tuning models for both STS and retrieval tasks? are there any common pitfalls I should watch out for, or any resources you'd recommend for optimizing performance across these different tasks?

thank you again for your time and insights. I really appreciate the support from the sentence-transformers community. Looking forward to any additional thoughts you might have, and I'll be sure to share my findings as I progress with these experiments.

tomaarsen commented 1 month ago

Is there a specific technical reason for this limitation, or is it more a matter of these models not being as widely adopted or supported?

I believe it's the latter: a decoder is pretty separate from the encoder, so it should essentially always be possible to add.

do you have any general advice for someone working on fine-tuning models for both STS and retrieval tasks? are there any common pitfalls I should watch out for, or any resources you'd recommend for optimizing performance across these different tasks?

I believe most of the large model authors use "query prefixing" for the retrieval query texts, i.e. they add some prompt like query:, Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:, Represent this sentence for searching relevant passages: like here, here or here.

Usually, the passages are not prefixed, and the STS texts are not prefixed. The idea is that with STS, texts with similar meanings are pushed together, but with retrieval, a text that answers a question is pushed together. These can contrast, as e.g. "Who founded Apple?" and "Steve Jobs, Steve Wozniak, and Ronald Wayne" are not semantically similar, but they should be close for retrieval. Prefixing the query (e.g. "query: Who founded Apple?" (in theory) should be placed near "Steve Jobs, Steve Wozniak, and Ronald Wayne", while "Who founded Apple?" might be placed near "Who founded NVIDIA?" and "Who founded Facebook?" via the STS learning.

One other thing: For STS learning, you can also adopt MultipleNegativesSymmetricRankingLoss. This is like the "normal" MultipleNegativesRankingLoss with in-batch negatives, but given (anchor, positive) pairs, MNRL only improves "given the anchor, find the positive", whereas the Symmetric variant also improves "given the positive, find the anchor". Because STS is a symmetric task, it can make sense to also train like it, and use all other anchors as the "in-batch negatives". The only downside is that there's no Cached variant of this loss, nor a GIST variant of this loss.

bobox2997 commented 1 month ago

Thank you so much for the reply! Yes, I used MultipleNegativesSymmetricRankingLoss for the Quora duplicate questions dataset and, in some runs, the sentence-compression dataset. Even though the sentence-compression dataset is not strictly "symmetric," I believe that both "given the passage, find the right summary" and "given the summary, find the right original passage" make sense as learning tasks. I achieved pretty good results with this approach, but as you mentioned, I had to remove the "symmetric" loss because using cached losses gave me better results.

The same applies to GIST, but I still need to determine when the guide model removes a false negative and when it removes a valuable hard negative.

Thanks again for your time!

tomaarsen commented 1 month ago

I had to remove the "symmetric" loss because using cached losses gave me better results.

Good to know!

The same applies to GIST, but I still need to determine when the guide model removes a false negative and when it removes a valuable hard negative.

Indeed, this is always a difficult trade-off to hit, and I've had better results with GISTEmbedLoss as well as identical to MNRL (never worse yet, but I've only done a handful of different models with them).

bobox2997 commented 4 weeks ago

I believe it's the latter: a decoder is pretty separate from the encoder, so it should essentially always be possible to add. (@tomaarsen)

I'm sorry to bother you again, but... Do you have any suggestion on how to do that? If I understand the TSDAE loss code the requirement is a decoder config and XXXLMHead, but I'm struggling to understand what this is, and I can't find any references or resource about that! Any kind of help would be sincerely appreciated, even if just pointing me to some resource where I can learn something about that! Thanks in advance!

tomaarsen commented 3 weeks ago

I'm sorry to bother you again

All good! I'm happy to help - I see you're making some cool models. Also apologies for my delay, I was on a short vacation.

In short, in transformers, models are organized based on the tasks for which they can be used:

Every single model architecture implements a modeling class for a certain number of these categories. For ALBERT, we have the https://github.com/huggingface/transformers/blob/main/src/transformers/models/albert/modeling_albert.py file, which implements:

For TSDAE (DenoisingAutoEncoderLoss), we use AutoModel for loading the encoder (this is just the "core" of the model, no specific "head" for e.g. classification) and AutoModelForCausalLM to try and load the decoder. This then uses this mapping to determine which class to use. E.g. for BERT it's BertLMHeadModel, for RoBERTa it's RobertaForCausalLM (most model classes are called ...ForCausalLM nowadays, with some exceptions of older architectures like GPT2 and BERT). As you can also see here, there is no class for albert or deberta_v2. You can expand these architectures with a CausalLM model (i.e. a model that can generate text), but you'll have to add a language modeling head to it. You could take some of the other architectures, e.g. RoBERTa, as inspiration: https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/roberta/modeling_roberta.py#L860-L863 but the language modeling head is just a few linear layers (+activations/normalizations) which will have empty weights by default. This might or might not be an issue with TSDAE - I have no idea.

If you do decide to make an ALBERT or DeBERTav2 CausalLM model, then you can follow the "Building custom models" docs to "register" your model architecture. Then it can be loaded with AutoModelForCausalLM for you, and then you can use TSDAE with your ALBERT/DeBERTav2 models.

That all said - this might be more work than it's worth, I'm not sure.

bobox2997 commented 2 weeks ago

Also apologies for my delay, I was on a short vacation.

Absolutely no need to apologize! Happy for you, Hope you enjoyed it!

I see you're making some cool models.

Well my huggingface account is quite chaotic 😅. Anyway, I had interesting results using AdaptiveLayerLoss, focusing on the KL divergence component of the loss, its temperature and with all layers trained at every iteration, with smaller weights. Anyway, I should clarify that the goal is not to make the model effective using less payers, but to use it as a kind of regularization and build a different stratification between layers (as example, i used some simpler dataset like sentence compression with much more weights on previous layers). Next step is to try to play with a custom loss inspired from adaptiveLayers but with different goals. ....and maybe see how a similar approach may work with a loss inspired from matrioska loss (ok, here in just brainstorming)

[...] You can expand these architectures with a CausalLM model (i.e. a model that can generate text), but you'll have to add a language modeling head to it. You could take some of the other architectures, e.g. RoBERTa, as inspiration: https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/roberta/modeling_roberta.py#L860-L863 but the language modeling head is just a few linear layers (+activations/normalizations) which will have empty weights by default. This might or might not be an issue with TSDAE - I have no idea.

If you do decide to make an ALBERT or DeBERTav2 CausalLM model, then you can follow the "Building custom models" docs to "register" your model architecture. Then it can be loaded with AutoModelForCausalLM for you, and then you can use TSDAE with your ALBERT/DeBERTav2 models.

Thanks so much for the clarification! Your explanation is really helpful, that "encoder-only model as decoder" somehow triggered me... I've took a look at the RoBERTa architecture and configs, as well to the DeBERTa and DeBERTa v3 papers... I should be able to take the high level concepts and some implementations from RoBERTa, but I think that a fine tuning on the LM heads weights would be required.

That all said - this might be more work than it's worth, I'm not sure.

Not on top of my task list, but I will go deeper on this, even if in spare time.... If not for TSDAE itself, it would be an interesting "learning project"