Open eedmond opened 1 month ago
It seems mistral v0.3 incorrectly put consolidated.safetensors
in the repo, you need to remove that file.
OK, thanks. It keeps trying to re-download this file, so to keep it simple, I'll just try Llama3, which is smaller.
Regarding continuous batching, I don't see much in the docs outlining how to accomplish this (I want to execute more prompts than would fit in memory, for example). Is there a good resource on how to loop and constantly add parallel prompts? Does it work to simply call llm.generate
in a loop with a certain number of prompts?
Thanks!
You can point model
to the local cache (usually under ~/.cache/huggingface
) after removing that file. By design aphrodite would read all *.safetensors
files in the repo, and that file is superfluous.
If you want continuous batching without using the api server, you need to use the AsyncAphrodite
class in your program instead of the LLM
class https://github.com/PygmalionAI/aphrodite-engine/blob/0178b4d97682dc165ecba184e7db509776847e33/aphrodite/engine/async_aphrodite.py#L281
Your current environment
How would you like to use Aphrodite?
I want to run Mistral-7B-v0.3 and send it repeated prompts using the continuous batching feature of Aphrodite. I'm following the wiki page for [Offline Inference], but it keeps crashing with an OOM error when running the simple Python script.
The only changes I've made from the wiki is
tensor_parallel_size
is set to1
and the model is set tov0.3
instead ofv0.1
(and added a closing)
to thellm.generate
line). Here is the output of executing that script with the OOM error:Any help on how to get this to run would be appreciated. Thanks!