Open aaronmueller opened 4 years ago
I think the first step is to simply 'not stop pretraining' on the contexts of the task training data.
Then,
What I had in mind for a new 'don't stop' hybrid objective, taking as inspiration the 'everything can be LM' idea from figure 1 of the raffel T5 paper:
Have the 'fine tuning' step be one that continues with a variant of the semisupervised objective alongside task label prediction. That is, a "document" would become a document + task-specification hybrid. This could take the form of MLM with an extra index or two for task specification using a bidirectional MLM model:
(original text)
this is a document
(text augmented, pre-mlm)
[cls] this is a document [sep] <token idx 1> <token idx 2> <dependency label>
(hybrid-MLM train examples)
1 [cls] this is a [mask] [sep] <token idx 1> <token idx 2> <dependency label>
2 [cls] this is a document [sep] <token idx 1> [mask] <dependency label>
3 [cls] this is a document [sep] <token idx 1> <token idx 2> [mask]
...
What might also be an interesting variant is an autoregressive model like mBART (encoder-decoder variant of bert with couple extra parameters ) that learns unsupervised denoising-- i.e. takes a doc that has been token+span masked, shuffled, rotated etc and returning the original text. This would only make sense if we thing having a seq2seq would be useful.
Supposedly, the new objective might preserve more structural information by combining the fine-tuning objective with a language modeling objective (probably inaccurate but that's the intuition).
Let's test how much structural information is preserved by doing some structural probing on this model vs. base mBERT when fine-tuned on various tasks. Let's also look at observational evidence à la Linzen/Goldberg.