RUCAIBox / LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
MIT License
566 stars 74 forks source link

同时测试多个子集的结果与单独测试每个子集的结果不同 #275

Closed xansar closed 1 week ago

xansar commented 1 month ago

我使用如下命令测试qwen2-7b-instruct在ceval的两个子集(basic_medicine, clinical_medicine),结果发现利用子集命令同时测试两个子集的结果与单独测试每个子集的结果并不相同 命令(仅修改-d 的子集不同)

CUDA_VISIBLE_DEVICES=1,2 python inference.py
    -m /data/home/XXX/.model/Qwen/Qwen2-7B-Instruct
    -d ceval:basic_medicine,clinical_medicine
    --log_level debug
    --evaluation_set "val"
    --seed 42
    --vllm True
    --flash_attention True
    --model_type chat
    --chat_template chatml
    --model_backend vllm
    --ranking_type generation
    --max_tokens 20
    --vllm_gpu_memory_utilization 0.4
    --hf_mirror

同时测试两个子集(-d ceval:basic_medicine,clinical_medicine)

结果:

##### ceval:basic_medicine #####
Accuracy: 26.32
##### ceval:clinical_medicine #####
Accuracy: 13.64
##### ceval[Marco Average] #####
Accuracy: 19.98

单独测试basic_medicine(-d ceval:basic_medicine)

##### ceval:basic_medicine #####
Accuracy: 26.32

单独测试clinical_medicine(-d ceval:clinical_medicine)

##### ceval:clinical_medicine #####
Accuracy: 9.09

第一个数据子集不会受影响,后续的数据子集的表现会发生改变 我怀疑这可能与batch_sampler中的AutoBatchSizeSampler相关,因为index out of range问题在debug模式下偶尔也会触发 https://github.com/RUCAIBox/LLMBox/issues/267#issue-2386605893

huyiwen commented 1 month ago

I will try to replicate the results

huyiwen commented 1 month ago

This is due to a multi-GPU issue. I will try to fix it.

huyiwen commented 1 month ago

Updates: To reproduce the results from vLLM accurately, ensure the temperature setting is manually set to 0, as the default setting is a non-zero temperature. This will help achieve consistency in the generated outputs.