Open ZJXNEFU opened 6 months ago
I notice you are using device_map="balanced"
, setting it to "auto" might be helpful.
If you are able to load the model in one GPU, don't load it into multiple, since passing parameters across GPUs is really slow. You could try using as few GPUs as possible for model parallelism.
You could also try quantizing the model(4 bit models are around 1GB/Bparameters):
https://huggingface.co/blog/4bit-transformers-bitsandbytes
Also, loading model with use_flash_attention_2=True
could speed it up.
When I load the 33B model by the method shown in below, it's too slow to generate a token. And per token is about 2.9s
have any solution about this?