alonj / Same-Task-More-Tokens

The code for the paper: "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models"
https://arxiv.org/abs/2402.14848
Apache License 2.0
49 stars 3 forks source link

The experiment setting for "next word prediction" experiment is a little bit confusing. #1

Open Zcchill opened 3 months ago

Zcchill commented 3 months ago

I would like to understand the specifics of how next-word accuracy was calculated in the paper when the number of distractors increased and input length grew. I am curious whether the objects considered for calculating next-word accuracy included distractors. If they did include distractors, the conclusions seem straightforward. However, if they did not include distractors, I am interested in knowing the position of the gold context within the full context in this experiment. Your response would greatly help me better understand the experiments in your paper!

alonj commented 3 months ago

Thanks for your question and interest in our work! To clarify, when testing the Next-Word Prediction accuracy, we took samples straight out of our dataset (which you can find in this repository or on Huggingface), and randomly sampled tokens at positions specified on the figure (to roughly match the points we sampled the sample length). We then tested the continuation token predicted by each model and checked if its an exact-match for the correct label token. The samples we used had both relevant information and irrelevant information for the original sample questions, but of course there is no such distinction when predicting the next tokens sampled at the middle of the input without the questions appended.

Let me know if this answers your question, or if I misunderstood it.