structural information - Githubissues

I think the first step is to simply 'not stop pretraining' on the contexts of the task training data.

Then,

What I had in mind for a new 'don't stop' hybrid objective, taking as inspiration the 'everything can be LM' idea from figure 1 of the raffel T5 paper:

Have the 'fine tuning' step be one that continues with a variant of the semisupervised objective alongside task label prediction. That is, a "document" would become a document + task-specification hybrid. This could take the form of MLM with an extra index or two for task specification using a bidirectional MLM model:

(original text) 
      this is a document 
(text augmented, pre-mlm) 
      [cls] this is a document [sep] <token idx 1> <token idx 2> <dependency label>
(hybrid-MLM train examples) 
1    [cls] this is a [mask] [sep] <token idx 1> <token idx 2> <dependency label>
2    [cls] this is a document [sep] <token idx 1> [mask]  <dependency label>
3    [cls] this is a document [sep] <token idx 1> <token idx 2> [mask]  
...

What might also be an interesting variant is an autoregressive model like mBART (encoder-decoder variant of bert with couple extra parameters ) that learns unsupervised denoising-- i.e. takes a doc that has been token+span masked, shuffled, rotated etc and returning the original text. This would only make sense if we thing having a seq2seq would be useful.

aaronmueller / dont-stop-pretraining

structural information #2