It seems the result we get is not the same as the repo shows

linbeyoung commented 7 months ago

this is the result we get with the code in this repo. we follow the readme step by step, making sure the environment, model and requirement are the same with the repo, but we are puzzled that we can not have the same score, especially at about 4k-tokens where the score is very low. 我们使用仓库的源代码，模型和环境进行复现，得到的结果是上面这张图，想请问一下可能是哪里出现问题？

FranxYao commented 7 months ago

Hi thanks for the interest! I wonder:

If you could paste the output of the wrong predictions and let's take a look at what the model says?

I was told that the model behavior may be sensitive to prompt. So maybe change the prompt at https://github.com/FranxYao/Long-Context-Data-Engineering/blob/main/eval/needle/needle_in_haystack.py#L201 to:

test_format=f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer: The best thing to do in San Francisco is"

FranxYao commented 7 months ago

OK try this branch which may fix the repeating issue at 4K length: https://github.com/FranxYao/Long-Context-Data-Engineering/tree/fix_rope

Where the difference is at the following line: https://github.com/FranxYao/Long-Context-Data-Engineering/blob/5b1fbd419e7a2ebc8644c7b78b8f7ed7186dcbe2/eval/needle/needle_in_haystack.py#L171

linbeyoung commented 7 months ago

llama-2-7b-80k-result.zip This was the output we get yesterday. And we are trying the new branch and new prompt.

FranxYao commented 7 months ago

Here is what I get from the new branch.

The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it to the main branch.

linbeyoung commented 7 months ago

Here is what I get from the new branch. The model behavior is quite interesting though. If you could confirm you can get similar results I'll merge it to the main branch.

Does it use the origin prompt?

FranxYao commented 7 months ago

Yes it does (thought would also be interesting to compare the two prompts)

linbeyoung commented 7 months ago

企业微信截图_3a89a116-7dd2-41dc-a94f-4a879f65e210 We've implemented the new branch and observed improved results. Indeed, it's significantly better. However, the overall score is 0.848, which still presents a slight discrepancy compared to the results mentioned in your repository. Could there be any potential reasons we might have overlooked?

FranxYao commented 7 months ago

One more comment before addressing: I won't close this issue until more people have seen it and verify if they can or cannot replicate my results.

Then back to the problem, let me first list the related packages here: torch==2.0.0+cu118 transformers==4.35.2 flash-attn==2.3.6 tensor_parallel==2.0.0 If you may check your torch is a different version?

Also if you are using the default prompt, maybe try adding "The best thing to do in San Francisco is" after "Answer:" ?

rooa commented 6 months ago

Hey @FranxYao , thanks for the great work on this paper. I was wondering, did you use the prompt in the code or the modified prompt above for the figure in the paper?

# prompt in the code
f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer:"

# prompt suggested above
f"<|im_start|> This is a very long story book: <book> {context} </book>.\n Based on the content of the book, Question: {self.retrieval_question}\nAnswer: The best thing to do in San Francisco is"

Thanks!

HannahBenita commented 6 months ago

@marcobellagente93 and I have noticed a similar inconsistency with the needle-in-a-haystack evaluation for another project. We have tracked down the problem to the reading of PaulGraham essays with glob.glob which is non deterministic, therefore the files are loaded in arbitrary order leading to a different ‘{context}’ being inserted in the prompt. Which in turn then leads to different behaviour of the model. Interestingly this phenomenon only appears in different clones of the repo, for one repo the ‘{context}’ seems to be consistent. You can easily verify this behaviour by cloning the Needle-in-a-haystack repo twice and printing the first 200 characters of the context in the read_context_files function.

For https://arxiv.org/abs/2402.17834 we therefore opted to report the mean and std across 10 runs with random context.

rooa commented 6 months ago

Great find! I totally missed that. File order being different after every cloning but consistent within each clone is probably because files are copied and registered by the file system arbitrarily.

FranxYao commented 6 months ago

@marcobellagente93 and I have noticed a similar inconsistency with the needle-in-a-haystack evaluation for another project. We have tracked down the problem to the reading of PaulGraham essays with glob.glob which is non deterministic, therefore the files are loaded in arbitrary order leading to a different ‘{context}’ being inserted in the prompt. Which in turn then leads to different behaviour of the model. Interestingly this phenomenon only appears in different clones of the repo, for one repo the ‘{context}’ seems to be consistent. You can easily verify this behaviour by cloning the Needle-in-a-haystack repo twice and printing the first 200 characters of the context in the read_context_files function.

For https://arxiv.org/abs/2402.17834 we therefore opted to report the mean and std across 10 runs with random context.

Thanks for the meticulous study!! I guess this may contribute to the reason.

Also I have updated the code and merged the fix_rope branch. The model should give more stable output now

FranxYao / Long-Context-Data-Engineering

It seems the result we get is not the same as the repo shows #3