Closed guillemram97 closed 2 days ago
I'm getting the same issue. Can anyone answer?
I think I've isolated part of the issue. When I don't allow one GPU then the model is split across GPUS:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6
I don't know if the issue is version-specific or happens for settings with >7 GPUs. Interestingly enough, 8 GPUs worked fine for Mistral-7B.
This was an issue with accelerate, find the fix here: https://github.com/huggingface/accelerate/pull/3244
System Info
Hardware: Amazon Linux EC2 Instance. 8 NVIDIA A10G (23 GB)
Reproduction
However, if I load without the quantization_config, no issue at all:
Expected behavior
The model is (mostly) being loaded to the last GPU. However, I'd expect it to be loaded across the different GPUs. Moreover, infer_auto_device_map seems to be not working. I have experienced this very similar issue with different hardware.