Using TSDAE for unsupervised learning on danish language

hansharhoff commented 2 years ago

I am attempting to do unsupervised sentence embedding learning using TSDAE on a corpus of danish sentences. I have been running tests with the example code which uses bert-base-uncased but as I understand the model card, this has only been trained on English.

My intention was to retrain on all-mpnet-base-v2 as this is listed highest for sentence embedding AND is trained on many languages. I get the following error though:

When tie_encoder_decoder=True, the decoder_name_or_path will be invalid.
Model name or path "sentence-transformers/all-mpnet-base-v2" does not support being as a decoder. Please make sure the decoder model has an "XXXLMHead" class.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<command-1706707024041026> in <module>
     28 
     29 # Use the denoising auto-encoder loss
---> 30 train_loss = losses.DenoisingAutoEncoderLoss(model_retrain, decoder_name_or_path=model_name, tie_encoder_decoder=True)

/databricks/python/lib/python3.8/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py in __init__(self, model, decoder_name_or_path, tie_encoder_decoder)
     55         except ValueError as e:
     56             logger.error(f'Model name or path "{decoder_name_or_path}" does not support being as a decoder. Please make sure the decoder model has an "XXXLMHead" class.')
---> 57             raise e
     58         assert model[0].auto_model.config.hidden_size == decoder_config.hidden_size, 'Hidden sizes do not match!'
     59         if self.tokenizer_decoder.pad_token is None:

/databricks/python/lib/python3.8/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py in __init__(self, model, decoder_name_or_path, tie_encoder_decoder)
     52         kwargs_decoder = {'config': decoder_config}
     53         try:
---> 54             self.decoder = AutoModelForCausalLM.from_pretrained(decoder_name_or_path, **kwargs_decoder)
     55         except ValueError as e:
     56             logger.error(f'Model name or path "{decoder_name_or_path}" does not support being as a decoder. Please make sure the decoder model has an "XXXLMHead" class.')

/databricks/python/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    446             model_class = _get_model_class(config, cls._model_mapping)
    447             return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
--> 448         raise ValueError(
    449             f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"
    450             f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}."

ValueError: Unrecognized configuration class <class 'transformers.models.mpnet.configuration_mpnet.MPNetConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of QDQBertConfig, TrOCRConfig, GPTJConfig, RemBertConfig, RoFormerConfig, BigBirdPegasusConfig, GPTNeoConfig, BigBirdConfig, Speech2Text2Config, BlenderbotSmallConfig, BertGenerationConfig, CamembertConfig, XLMRobertaConfig, PegasusConfig, MarianConfig, MBartConfig, MegatronBertConfig, BartConfig, BlenderbotConfig, ReformerConfig, RobertaConfig, BertConfig, OpenAIGPTConfig, GPT2Config, TransfoXLConfig, XLNetConfig, XLMProphetNetConfig, ProphetNetConfig, XLMConfig, CTRLConfig, ElectraConfig.

I am not sure I understand why MPNet is not "allowed" as I do not understand what XXXLMHead is and how to determine if it is present for a given model. Further down, in what I expect to be the real issue, it lists the allowed model types.

It is unclear to me however how to figure out how these model types map to e.g. the list here:

https://www.sbert.net/docs/pretrained_models.html

and thus it is not clear to me which of the highly ranked pre-trained models are compatible with TSDAE (and secondarily multi-lingual ;) )

Any advice on how to find a suitable model candiate for TSDAE unsupervised learning with Danish support?

hansharhoff commented 2 years ago

My own update on this: I still am not clear on what XXXLMHead is, but using the hint from the exception I went looking for a good pretrained multilingual model. I have chosen xlm-roberta-base and will run TSDAE on top of this.

mlmonk commented 2 years ago

I ran into the same issue. My understanding is that TSDAE can use any compatible decoder since the decoder is only used during training time and not during inference. To get around this issue, I simply set my loss as follows: losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path="sentence-transformers/paraphrase-distilroberta-base-v2", tie_encoder_decoder=False)

I used paraphrase-distilroberta-base-v2 because it has the same hidden size as all-mpnet-base-v2 and tie_encoder_decoder is false because both the architectures are different.

However, your original question of why all-mpnet-base-v2 does not work as a decoder is still a mystery to me.

patrick-vtc commented 1 year ago

Just a few thoughts about the comments:

all-mpnet "AND is trained on many languages." > this is not true, you should use xlm-* models for that as a base. all-mpnet is english base, for mpnet itself as well as the (semi) supervised constrastive training.
the huggingface implementation of all-mpnet has no decoder implemented, that's why it is failing.
tie_encoder_decoder=False and specifying a model with same dimensions should work. Even though this might be a bit risky, depending on how far away these two models are from each other regarding their training.
Anyway, you should add another supervised layer after your domain adaption / pre training with TSDAE for your source domain.

UKPLab / sentence-transformers

Using TSDAE for unsupervised learning on danish language #1662