InftyAI / llmaz

☸️ Easy, advanced inference platform for large language models on Kubernetes
Apache License 2.0
26 stars 10 forks source link

[ModelLoader] Some huggingface models may contain duplicated weights #163

Closed kerthcet closed 1 week ago

kerthcet commented 1 month ago

What would you like to be added:

Take Mistral for example, it not only contain the chunked model weights, it also has consolidated model weights, when downloading models from huggingface, we should pay attention to this or we will download two replicas of model weights.

Why is this needed:

Fast model loading.

Completion requirements:

This enhancement requires the following artifacts:

The artifacts should be linked in subsequent comments.

kerthcet commented 1 month ago

/kind feature

qinguoyi commented 1 month ago

In another issuse https://github.com/InftyAI/llmaz/pull/175#issuecomment-2372716947, there has a new project which shares model weights across the cluster, may change the code with models.

so, i want to know Is it still necessary to develop this feature? this project to get model with python, but new project get model with go.

kerthcet commented 1 month ago

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

WDYT?

qinguoyi commented 1 month ago

I agree with you, i will impl this feature soon.

qinguoyi commented 1 month ago

Yes, we need this, because Manta may leverage the code as well, we don't want to rewrite the client code with other languages anymore.

What I'm concerned about is how to make this a more general approach, maybe we can add two fields in the ModelHub, the allow_patterns and the ignore_patterns, which will be passed to the lib directly. You can refer to the huggingface snapshot_download func for details. modelScope has the similar parameters as well.

I also have two other suggestions:

  • Remove the ThreadPoolExecutor for modelScope, because there's only one thread
  • When downloading one file with huggingface lib, let's use hf_hub_download
  • When downloading the whole repo with huggingface lib, let's use snapshot_download which will downloads files concurrently and we can remove the ThreadPoolExecutor as well.

WDYT?

when i develop, i find we can download one file use snapshot_download with allow_patterns to download one or more files.

i push a request in there https://github.com/InftyAI/llmaz/pull/178#issue-2553977136 PTAL.

qinguoyi commented 1 week ago

Could we close this issue now? @kerthcet

kerthcet commented 1 week ago

Absolutely, fixed by https://github.com/InftyAI/llmaz/pull/178 /close