Open marcofarina84 opened 1 year ago
Great, thanks! Just one last clarification, I might be misunderstanding the code but it seems like the function is feeding to the amateur only the last generated token, so the amateur is computing $p(xi|x{i_1})$. Can you confirm it? While section 3.4 of the paper seems to states that the amateur is conditioned on the last token of the prompt + all the generated tokens.
Hi,
I think the code is doing what section 3.4 states, conditioning on last token in prompt + generated tokens. You can verify this by printing the past_key_values argument. This works because of the caching implementation in huggingface, once a token is generated, it will be encoded as past_key_values to save some redundant computation.
Dear @XiangLi1999 and @ari-holtzman, if I understand correctly the paper, in section 3.4, mentions that the amateur (student) model is conditioned on a context window which starts from the last token of the prompt. I cannot find any trace of such a choice in the code, for instance here and here the whole input is passed to the amateur model, as seen by the expert too.
I cannot find the relative study in the ablation script either.
Am I missing some argument/logic that sets the amateur's context window somewhere else in the code?
Best, Marco