Attention masking over continuation enc

bmosaicml commented 2 years ago

In BaseLM we pass the context and continuation into the model all in one tensor. Why do we not need to provide an attention mask to mask out the whole continuation? Won't this allow the models to attend over previous parts of the continuation while producing subsequent portions?

I think @leogao2 may have written this code, do you have any insight into why this is?

https://github.com/EleutherAI/lm-evaluation-harness/blob/2598f990372e17f39ca432b03f0c279f6fa6118b/lm_eval/base.py#L264-L269

a-cavalcanti commented 2 years ago

I'm a bit confused about the evaluation. I was debugging the code (using the lambada task) and I came across this situation. The input is being cut in the wrong position?

Because in this example the word "drapes" appears as "drap" and the model is completing correctly with "es". The target is "drapes" and because of the wrong cut, the match is different.

Someone can comment something about it?

Input text (inp): Yanking back a fistful of the thief's hair and sinking her teeth into an unprotected throat. It was a strange, unsettling vision, but for a moment it almost seemed real. She became aware that Bonnie and Meredith were looking at her. "Well?" she said, feeling slightly uncomfortable. "I could tell you weren't listening," sighed Bonnie It can be anywhere from 4% up to 12%, with the average around 6% or 8%. The broker then turns around and shares his or her proceeds with the selling broker, who is the broker representing the buyer. (Confusing, I know.) In a net listing, however, the owner receives a specified — net — amount from the sale, with the excess going to the broker I wanted to shout myself hoarse at him, but I couldn't do it. I just wanted him gone. When I didn't move, he closed the heavy drapes around the bed, sealing me off in a dark little cage. 23 Antiquary I didn't sleep for hours. I could hear him at his desk, writing away, hidden from me only by the drap

Model output (greedy_tokens): tensor([[259, 299]]) es

Target (cont_toks): tensor([[95904, 299]]) drapes

bmosaicml commented 2 years ago

I am fairly confident the indexing for the targets s correct. Is your confusion possibly due to the fact that the n^th target corresponds to the prediction for the n+1^st token?

i.e. if token "drap" is index 100 in the input, the correct output at index 100 would be "es".

If possible could you move your question into a separate issue since I don't think it is relevant to the issue I raised :)

Thanks

I'm a bit confused about the evaluation. I was debugging the code (using the lambada task) and I came across this situation. The input is being cut in the wrong position?

Because in this example the word "drapes" appears as "drap" and the model is completing correctly with "es". The target is "drapes" and because of the wrong cut, the match is different.

Someone can comment something about it?

Input text (inp): Yanking back a fistful of the thief's hair and sinking her teeth into an unprotected throat. It was a strange, unsettling vision, but for a moment it almost seemed real. She became aware that Bonnie and Meredith were looking at her. "Well?" she said, feeling slightly uncomfortable. "I could tell you weren't listening," sighed Bonnie It can be anywhere from 4% up to 12%, with the average around 6% or 8%. The broker then turns around and shares his or her proceeds with the selling broker, who is the broker representing the buyer. (Confusing, I know.) In a net listing, however, the owner receives a specified — net — amount from the sale, with the excess going to the broker I wanted to shout myself hoarse at him, but I couldn't do it. I just wanted him gone. When I didn't move, he closed the heavy drapes around the bed, sealing me off in a dark little cage. 23 Antiquary I didn't sleep for hours. I could hear him at his desk, writing away, hidden from me only by the drap

Model output (greedy_tokens): tensor([[259, 299]]) es

Target (cont_toks): tensor([[95904, 299]]) drapes

haileyschoelkopf commented 8 months ago

Why do we not need to provide an attention mask to mask out the whole continuation? Won't this allow the models to attend over previous parts of the continuation while producing subsequent portions?

This is a feature, not a bug: when measuring the loglikelihood of a model to predict an N-token completion conditioned on a context string, it will have to first predict token 1 of the completion, and then has that first completion token in context to condition on when determining its predictions for the next N-1 completion tokens, and so on.

EleutherAI / lm-evaluation-harness

Attention masking over continuation enc #351