Closed chunniunai220ml closed 4 months ago
when MMLU eval, same single GPU, it success: lm_eval --model hf \ --model_args pretrained="Qwen/Qwen1.5-7B-Chat",dtype=bfloat16 \ --tasks mmlu \ --device cuda \ --batch_size 32 \ --trust_remote_code \ --cache_requests true
but when --num_fewshot 5 , batchcsize masu <-=4, otherwise OOM:
it seems more GPU overhead, any suggestions for reduce GPU mem and accelerate?
This is expected because fewshot examples will result in larger model inputs. I'd recommend checking out vllm for faster inference!
when MMLU eval, same single GPU, it success: lm_eval --model hf \ --model_args pretrained="Qwen/Qwen1.5-7B-Chat",dtype=bfloat16 \ --tasks mmlu \ --device cuda \ --batch_size 32 \ --trust_remote_code \ --cache_requests true
but when --num_fewshot 5 , batchcsize masu <-=4, otherwise OOM:
it seems more GPU overhead, any suggestions for reduce GPU mem and accelerate?