Can not reproduce results by LLAMA-7B on OpenBook QA - Githubissues

FMInference / H2O

[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.

400 stars 44 forks source link

Can not reproduce results by LLAMA-7B on OpenBook QA #24

Open AkideLiu opened 8 months ago

AkideLiu commented 8 months ago

Full Cache Baseli huggyllama/llama-7b bash scripts/lm_eval/full_cache.sh openbookqa huggyllama/llama-7b llama

{
  "results": {
    "openbookqa": {
      "acc": 0.446,
      "acc_stderr": 0.022252153078595897,
      "acc_norm": 0.49,
      "acc_norm_stderr": 0.022378596989230774
    }
  },
  "versions": {
    "openbookqa": 0
  }
}

huggyllama/llama-7b H2O bash scripts/lm_eval/h2o.sh openbookqa huggyllama/llama-7b llama

{
  "results": {
    "openbookqa": {
      "acc": 0.412,
      "acc_stderr": 0.02203367799374087,
      "acc_norm": 0.462,
      "acc_norm_stderr": 0.022318338119870537
    }
  },
  "versions": {
    "openbookqa": 0
  }
}

As shown in the paper :

PiotrNawrot commented 8 months ago

+1, I'm getting exactly the same results

Kyriection commented 8 months ago

Hi, the results in Table 6 are obtained from OPT-30B (As described in 5.3.Q3). And for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage. Since OpenbookQA only requires one step decoding, our current implementation is a simulation version that decomposes the original prefilling stage into a two parts. And we consider the second part as a simulated decoding stage. In this simulation version, we only use the local statistics of accumulation attention scores which might be biased when the sequence length is extremely small.

PiotrNawrot commented 8 months ago

Hey @Kyriection - Thanks a lot for your response and extra clarification. I'm having one more issue with reproducing Figure 8 from the latest version of the paper. I followed your setup exactly and haven't changed anything in the code - just calling commands from the README. Below I paste a screenshot Excel with my results - in my attempt the downstream scores downgrade much quicker than reported in Figure 8. Do you have any idea why I cannot reproduce those results ? I'm using huggyllama-llama-7b and even heavy and recent ratio.

PiotrNawrot commented 8 months ago

Moreover I'm also having issues with reproducing Table 2 results from the paper for OPT-30B. Again I believe that I'm strictly following the commands from the README. It would be of great help if you could comment on this - and congrats ones again on the amazing work!

PiotrNawrot commented 8 months ago

"and for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage"

Did you use scores from prefilling stage for any of the downstream results reported in the paper or did you use the simulated decoding? I believe that the implementation in the repo, at least for the LM-Eval, follows the simulated decoding approach.

Kyriection commented 8 months ago

Hi, we adjust the ratio of how much part of prefilling stage are used for the simulated decoding approach. Since some input samples only contain tens of tokens, using 20% for calculating accumulated attention scores is highly biased. For simplicty, you can directly use the whole prefilling stage for calculating the scores, which is a reasonable and practical setting.

PiotrNawrot commented 8 months ago

Yes, I understand - is this logic implemented somewhere in the code?

Also, do you have any idea what could be the reason behind my suboptimal results?

Kyriection commented 8 months ago

Hi, you can use the implementaion here https://github.com/FMInference/H2O/blob/main/h2o_hf/utils_lm_eval/modify_llama.py#L152. (I tested current implementation with llama-1-7b on openbookqa, full accuracy is 44.6 and H2O is 44.4.)

Previous simulation implemention will directy use the first 20% prefilling stage for calculating accumulated attention scores which are biased when input samples only contains tens of tokens. This might be the reason behind the suboptimal results. By increase the ratio of the prefilling stage for calculating accumulated attention scores, or directly use the whole prefilling stage(global statistics), such biased can be largely mitigated, resulting in better performance.

yinwangsong commented 5 months ago

"and for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage"

Did you use scores from prefilling stage for any of the downstream results reported in the paper or did you use the simulated decoding? I believe that the implementation in the repo, at least for the LM-Eval, follows the simulated decoding approach.

Hello, did you find code of the ''simulated decoding'' in this repo? Thanks.