aaronmueller / dont-stop-pretraining

Adapting the Don't Stop Pretraining approach for multilingual applications. Modified by Aaron Mueller and Nathaniel Weir.
0 stars 0 forks source link

structural information #2

Open aaronmueller opened 4 years ago

aaronmueller commented 4 years ago

Supposedly, the new objective might preserve more structural information by combining the fine-tuning objective with a language modeling objective (probably inaccurate but that's the intuition).

Let's test how much structural information is preserved by doing some structural probing on this model vs. base mBERT when fine-tuned on various tasks. Let's also look at observational evidence à la Linzen/Goldberg.

nweir127 commented 4 years ago

I think the first step is to simply 'not stop pretraining' on the contexts of the task training data.

Then,

What I had in mind for a new 'don't stop' hybrid objective, taking as inspiration the 'everything can be LM' idea from figure 1 of the raffel T5 paper:

Have the 'fine tuning' step be one that continues with a variant of the semisupervised objective alongside task label prediction. That is, a "document" would become a document + task-specification hybrid. This could take the form of MLM with an extra index or two for task specification using a bidirectional MLM model:

(original text) 
      this is a document 
(text augmented, pre-mlm) 
      [cls] this is a document [sep] <token idx 1> <token idx 2> <dependency label>
(hybrid-MLM train examples) 
1    [cls] this is a [mask] [sep] <token idx 1> <token idx 2> <dependency label>
2    [cls] this is a document [sep] <token idx 1> [mask]  <dependency label>
3    [cls] this is a document [sep] <token idx 1> <token idx 2> [mask]  
...

What might also be an interesting variant is an autoregressive model like mBART (encoder-decoder variant of bert with couple extra parameters ) that learns unsupervised denoising-- i.e. takes a doc that has been token+span masked, shuffled, rotated etc and returning the original text. This would only make sense if we thing having a seq2seq would be useful.