33B inference too slowly

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

6.6k stars 461 forks source link

33B inference too slowly #142

Open ZJXNEFU opened 6 months ago

ZJXNEFU commented 6 months ago

When I load the 33B model by the method shown in below, it's too slow to generate a token. And per token is about 2.9s

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="balanced")

have any solution about this?

lijierui commented 6 months ago

I notice you are using device_map="balanced", setting it to "auto" might be helpful. If you are able to load the model in one GPU, don't load it into multiple, since passing parameters across GPUs is really slow. You could try using as few GPUs as possible for model parallelism. You could also try quantizing the model(4 bit models are around 1GB/Bparameters): https://huggingface.co/blog/4bit-transformers-bitsandbytes Also, loading model with use_flash_attention_2=True could speed it up.