Open KrishnanJothi opened 2 years ago
We never tested it.
Sadly not all encoder model implement the necessary encoder-decoder architecture in HF transformers
Okay, you mean not all the encoder transformers in HF gives the flexibility to build TSDAE encoder-decoder architecture for denoising?
Correct. e.g. distilbert does not have the necessary encoder-decoder architecture implemented. @kwang2049 created such a BERT2BERT architectur for DistilBERT model architecture
Got it, thanks Nils!
Hi @KrishnanJothi, thanks for your attention. So the default setting of TSDAE is to use this bert2bert architecture, where you have exactly the same PLM as the initialization for both encoder and decoder. Another choice could be to use a separate decoder PLM, e.g. BERT as encoder and RoBERTa as the decoder. Sadly, I found the latter approach usually suffers from some performance drop (e.g. 3 points of MAP on retrieval tasks).
For distilbert, my hack on the original HF model class is available here: https://gist.github.com/kwang2049/1f0e1f0ce119456284c0af048ba097a7. One can also mimic this to add support for other PLM architectures. Actually, there is a PR still open for this in the HF repo.
For the first question about TSDAE for token classification. I am not really sure, but I think this is also very interesting. As I can imagine, TSDAE + mean pooling could work somehow as a good pre-training method. Since you are interested in token-level representations, another straightforward idea is to use BART pre-training, which uses all the token embeddings during training + denoising tasks.
@kwang2049 Thank you for your comment, I will definitely take a look into BART pre-training.
Hi,
I would like to know whether TSDAE procedure is advisable for token classification task? or is it better to go with MLM?
Can the TSDAE training code can be also used with any transformer (encoder-based) model from the huggingface?
Thanks, Krishnan