bigscience-workshop / t-zero

Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)
Apache License 2.0
456 stars 53 forks source link

Prediction for multi-token multiple choice? #38

Closed AADeLucia closed 2 years ago

AADeLucia commented 2 years ago

There are a mix of single-token multiple choice and multi-token multiple choice in the prompt dataset. In the run_eval.py code, it appears to only be written for single-token multiple choice. I only see a single call to forward:

https://github.com/bigscience-workshop/t-zero/blob/master/evaluation/run_eval.py#L348

How do you calculate the probability of each multi-token/phrase option? Is that code in this repo?

Thanks.

AADeLucia commented 2 years ago

@VictorSanh ?

VictorSanh commented 2 years ago

Hi @AADeLucia, thanks for your patience, I was heads down wrapping up a sprint.

The inference of the model does support multi-token labels. That's why we need label token masking in the forward: https://github.com/bigscience-workshop/t-zero/blob/a961d57704c682a8ef58f56bb9c8a41a8bd8f1a8/t0/model.py#L56, and calculate the log prob of the sequence from that

AADeLucia commented 2 years ago

Ah, thank you. I know this is outside of T0 support, but could you point me to any resources that explain how the multi-token inference works? I'm familiar with auto-regressive language models (GPT-2) but not how this works with text-infilling.

In the T5 paper:

The decoder in an encoder-decoder Transformer is used to autoregressively produce an output sequence. That is, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on

This makes me think the model is decoding greedily (i.e., feed in token1 then token2), but I only see a single call to forward to produce multi-token output.

Is the probability calculated independently? i.e., "token1" following the input and "token2" following the input? Or does the order matter?

VictorSanh commented 2 years ago

I would recommend this blogpost if you want to understand better generation methods!

For T0, we are not generating sequences of outputs, but rather take the multiple choice (the few classification options), and compute their probability (log probability to be exact) under the model (the log probability of the option through the decoder conditioned on the encoder which has been feed the input). So we are literally feeding, to the encoder, and feeding to the decoder, and computing the log prob.

AADeLucia commented 2 years ago

Got it, I was confusing log prob with what you would get going through generate(). Thank you!