InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.67k stars 427 forks source link

[Feature] Low priority: Allow specifying HuggingFace model/repo name in `lmdeploy convert` #1749

Open josephrocca opened 5 months ago

josephrocca commented 5 months ago

Motivation

This is not an important feature, but I figured I'd mention it because it was a small point of friction that I think could be improved in the future. Currently my script does this:

pip install 'huggingface_hub[cli,hf_transfer]==0.23.2'
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download lmdeploy/llama2-chat-70b-4bit --local-dir /root/llama2-chat-70b-4bit
lmdeploy convert llama2 /root/llama2-chat-70b-4bit --model-format awq --group-size 128 --tp $(nvidia-smi -L | wc -l) --dst-path /root/turbomind-model-files

Ideally I could just write this:

lmdeploy convert llama2 lmdeploy/llama2-chat-70b-4bit --model-format awq --group-size 128 --tp $(nvidia-smi -L | wc -l) --dst-path /root/turbomind-model-files

And it as a bonus nice-to-have it would be cool if I only needed enough disk space to fit one version of the model - e.g. for 70B 4bit model, if my disk space was 40GB, ideally that would be enough. But currently ~80GB is required because both AWQ and turbomind formats must be stored on disk at the same time until turbomind conversion is complete. But this feature is not very important because disk space is cheap.

Related resources

No response

Additional context

No response

josephrocca commented 5 months ago

By the way, an even better feature than the one above (for my user case) would be to allow specifying a cache location for the turbomind converted model files when using the lmdeploy serve api_server command. This command allows specifying huggingface repo, and converts the models automatically, but it does not save them to the mount directory, so restart causes loss, and need to convert them again when executing lmdeploy serve api_server the second time.

The reason I had to use lmdeploy convert is because my server restarts/crashes every 10 minutes or so due to this issue:

So it's important that the startup process is fast - hence the need to pre-convert the model to turbomind format (which takes about 2 mins), so it doesn't need to re-convert every time the server starts.

lvhan028 commented 5 months ago

I understand your concern.

For the first feature "Allow specifying HuggingFace model/repo name in lmdeploy convert", we can accept it, but I am afraid it cannot help saving the disk space. Because if model/repo is not a local path, lmdeploy will use snapshot_download from huggingface_hub to download the model to the local cache path.

Regarding the startup process fast proposal, I wouldn't like to develop the caching turbomind model feature since it will introduce the cache management function, bringing a burden for maintenance.

In my opinion, the key is to fix #1744 to avoid frequently restart caused by crash. And secondly, try to optimize the converter, making the process faster, e.g., converting the layers concurrently.