Open josephrocca opened 5 months ago
By the way, an even better feature than the one above (for my user case) would be to allow specifying a cache location for the turbomind converted model files when using the lmdeploy serve api_server
command. This command allows specifying huggingface repo, and converts the models automatically, but it does not save them to the mount directory, so restart causes loss, and need to convert them again when executing lmdeploy serve api_server
the second time.
The reason I had to use lmdeploy convert
is because my server restarts/crashes every 10 minutes or so due to this issue:
So it's important that the startup process is fast - hence the need to pre-convert the model to turbomind format (which takes about 2 mins), so it doesn't need to re-convert every time the server starts.
I understand your concern.
For the first feature "Allow specifying HuggingFace model/repo name in lmdeploy convert", we can accept it, but I am afraid it cannot help saving the disk space. Because if model/repo
is not a local path, lmdeploy will use snapshot_download
from huggingface_hub
to download the model to the local cache path.
Regarding the startup process fast
proposal, I wouldn't like to develop the caching turbomind model feature since it will introduce the cache management function, bringing a burden for maintenance.
In my opinion, the key is to fix #1744 to avoid frequently restart caused by crash. And secondly, try to optimize the converter, making the process faster, e.g., converting the layers concurrently.
Motivation
This is not an important feature, but I figured I'd mention it because it was a small point of friction that I think could be improved in the future. Currently my script does this:
Ideally I could just write this:
And it as a bonus nice-to-have it would be cool if I only needed enough disk space to fit one version of the model - e.g. for 70B 4bit model, if my disk space was 40GB, ideally that would be enough. But currently ~80GB is required because both AWQ and turbomind formats must be stored on disk at the same time until turbomind conversion is complete. But this feature is not very important because disk space is cheap.
Related resources
No response
Additional context
No response