UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.27k stars 2.47k forks source link

TSDAE for token classification? #1540

Open KrishnanJothi opened 2 years ago

KrishnanJothi commented 2 years ago

Hi,

I would like to know whether TSDAE procedure is advisable for token classification task? or is it better to go with MLM?

Can the TSDAE training code can be also used with any transformer (encoder-based) model from the huggingface?

Thanks, Krishnan

nreimers commented 2 years ago

We never tested it.

Sadly not all encoder model implement the necessary encoder-decoder architecture in HF transformers

KrishnanJothi commented 2 years ago

Okay, you mean not all the encoder transformers in HF gives the flexibility to build TSDAE encoder-decoder architecture for denoising?

nreimers commented 2 years ago

Correct. e.g. distilbert does not have the necessary encoder-decoder architecture implemented. @kwang2049 created such a BERT2BERT architectur for DistilBERT model architecture

KrishnanJothi commented 2 years ago

Got it, thanks Nils!

kwang2049 commented 2 years ago

Hi @KrishnanJothi, thanks for your attention. So the default setting of TSDAE is to use this bert2bert architecture, where you have exactly the same PLM as the initialization for both encoder and decoder. Another choice could be to use a separate decoder PLM, e.g. BERT as encoder and RoBERTa as the decoder. Sadly, I found the latter approach usually suffers from some performance drop (e.g. 3 points of MAP on retrieval tasks).

For distilbert, my hack on the original HF model class is available here: https://gist.github.com/kwang2049/1f0e1f0ce119456284c0af048ba097a7. One can also mimic this to add support for other PLM architectures. Actually, there is a PR still open for this in the HF repo.

For the first question about TSDAE for token classification. I am not really sure, but I think this is also very interesting. As I can imagine, TSDAE + mean pooling could work somehow as a good pre-training method. Since you are interested in token-level representations, another straightforward idea is to use BART pre-training, which uses all the token embeddings during training + denoising tasks.

KrishnanJothi commented 2 years ago

@kwang2049 Thank you for your comment, I will definitely take a look into BART pre-training.