Question about the reproduction of XSUM results

SherrySwift commented 7 months ago

Hi, thanks for your great works! I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:

# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}

I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9% According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12% Can't figure out the reason about it. Would you please give me some advice? Thanks a lot!

Kyriection commented 7 months ago

Hi, Thanks for your question. Did you use Llama-2-7b? The model used in the paper is "huggyllama/llama-7b".

SherrySwift commented 7 months ago

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error. Do you have any ideas about it? Thanks a lot !

Kyriection commented 7 months ago

Hi, could you provide the detailed command and tranformers version you used? I didn't reproduce the same issue on my side when using huggyllama/llama-7b.

SherrySwift commented 7 months ago

Thanks for your reply. Here is the command: bash scripts/summarization/eval.sh xsum 5 full 0

The contents in scripts/summarization/eval.sh are:

task=$1
shots=$2
method=$3
GPU=$4
HH_SIZE=$5
RECENT_SIZE=$6

if [[ ${method} == 'h2o' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_h2o_hh${1}_local${2}.jsonl \
        --model_name huggyllama/llama-7b
        --hh_size ${HH_SIZE} \
        --recent_size ${RECENT_SIZE} \
        --cache_dir ../../llm_weights \
        --enable_h2o_cache
elif [[ ${method} == 'full' ]]; then
    CUDA_VISIBLE_DEVICES=${GPU} python -u run_summarization.py \
        --input_path data/summarization_data/${task}_${shots}shot.jsonl \
        --output_path summary_results/${task}_${shots}shot_full.jsonl \
        --model_name huggyllama/llama-7b
else
    echo 'unknown argment for method'
fi

As for tranformers version, I tried both 4.33.0 and 4.35.0, and I encounter the same problem.

SherrySwift commented 7 months ago

by the way, the above error also occur in the middle of evaluation when I use other models (such as llama-2-7b) Here is part of the log:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310912, rouge-2: 0.118365, rouge-l: 0.260621
 80%|███████████████████████████████████████████████████████████████████▋                 | 796/1000 [1:12:14<18:08,  5.33s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
rouge-1: 0.310952, rouge-2: 0.118289, rouge-l: 0.260724
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:19<18:08,  5.36s/it]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 80%|███████████████████████████████████████████████████████████████████▋                 | 797/1000 [1:12:23<18:26,  5.45s/it]
Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 137, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

When seek for solutions, I found this issue. Is it possible that this error is related to beam sample that used in generation process?

Kyriection commented 7 months ago

Hi, I tested the samples from 795 to 800, but didn't encounter the same error.

Based on your error information, could you try to specify "pad_token_id=tokenizer.eos_token_id" in the model.generate() function.

SherrySwift commented 7 months ago

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem. Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

Kyriection commented 7 months ago

Hi, I followed the original HELM for these parameters. Generally, large temperature will bring more diversity and less deterministic.

SherrySwift commented 7 months ago

Sorry to bother you again. In h2o_hf/data directory, there are several different jsonl files for xsum dataset. In order to reproduce the result in Figure 4 in paper (i.e. Rouge-2 is 12 for llama-7b), which jsonl file should I use? I notice that the content between xsum_5shot.jsonl and xsum.jsonl are quite different. So got a liittle bit confused about that.

ThisisBillhe commented 6 months ago

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

slatter666 commented 6 months ago

Hi, I used huggyllama/llama-7b, but I encounterd the following errors when I try to run scripts/summarization/eval.sh:

Traceback (most recent call last):
  File "/data1/H2O-main/h2o_hf/run_summarization.py", line 138, in <module>
    output_sequences = model.generate(
  File "/usr/local/miniconda3/envs/atom/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 1719, in generate
    return self.sample(
  File "/data1/LLM/transformers/src/transformers/generation/utils.py", line 2837, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

when I load other models like Llama-2-7b, there won't be such an error. Do you have any ideas about it? Thanks a lot !

I use Llama-2-7b but I still get this error, I use float16. And I check this piece of data, the prompt has 6768 tokens so I guess this is because prompt length is too long so the model collapse

zwxandy commented 5 months ago

Thanks for your patience, but specify "tokenizer.pad_token_id=tokenizer.eos_token_id" still cannot solve the problem. Since I couldn't come up with a better solution, I just skip the sample 797 in the end.

Also, I notice that you set 'temperature=0.3, top_p=1, do_sample=True' in model.generate() function in h2o_hf/run_summarization.py, is there any particular reason for these parameter settings? Just wonder about it.

Hi, I have also met the same bug when the generation process comes to 797/1000:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

So I try to test the sample 797 by modifying line #117 as

requests = requests[795:]

As expected, the bug occurs at 2/205 again. So I go to check the dataset, i.e., sum_5shot.jsonl, and find this sample is marked as

Tokenization is skipped for long lines for performance reasons. This can be configured via editor.maxTokenizationLineLength.

Obviously, the reason for the model collapse is that the prompt length is too long.

zwxandy commented 5 months ago

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, thanks for your great works! I have some questions about the reproduction of XSUM results. I tried to run this command in h2o_hf dir:
# Full baseline on XSUM
shots=5
GPU-ID=0
bash scripts/summarization/eval.sh xsum ${shots} full ${GPU-ID}
I tested on all 1000 samples in xsum_5shot.jsonl, using LLaMA-7B model, but the ROUGE-2 result that I got is only about 9% According to Figure 4 in paper, the full baseline of XSUM, LLaMA-7B is 12% Can't figure out the reason about it. Would you please give me some advice? Thanks a lot!

Hi, I also used huggyllama/llama-7b to run the XSUM task, and got the same conclusion as yours:

rouge-1: 0.267594, rouge-2: 0.098886, rouge-l: 0.222643

Do you have any ideas about this?

zwxandy commented 5 months ago

Hi everyone, I have another question regarding reproducing XSUM results. In h2o_hf/scripts/summarization/eval.sh, it sets a fixed HH_SIZE and RECENT_SIZE, but the x-axis of figure 4 represents KV Cache Budget (%), so what is the relationship between size and percentage? The total number of tokens varies with each sample, right?

Hi, I also want to know this question. Do you have any ideas?

FMInference / H2O

Question about the reproduction of XSUM results #20