Open ZachZimm opened 3 months ago
Hey @ZachZimm
Thanks a lot!
This makes a lot of sense.
I thought having 2 workers will allow all users to have parallel calls enabled by default.
But it seems I didn't account for this issue.
I will make workers default to 1 and investigate how to have all workers talk or the master node handle routing.
I'd like to implement a memory sharing solution using multiprocessing.manager
, essentially swapping out the ModelProvider.models
dict
with a shared memory multiprocessing.manager.dict
. But, I would appreciate it if you would merge my previous PR first as I am not very familiar with git and would like to avoid having 2 working branches.
hey @ZachZimm
Thanks for the patience, I was unavaible this last week.
That would be fantastic, no problem!
I left some comments
So I tried implementing shared memory with multiprocessing.Manager
but I found that mlx's lm_load
would hang and eventually complain about something not being serializable. I really don't know what the proper way to go about sharing the model in memory is if this approach doesn't work, but maybe building ModelProvider up into a process separate from the FastAPI app (exclusively for model management, and providing a pointer to the model object to FastAPI workers) would be a workable approach?
In the current workers implementation, each worker creates its own ModelProvider, and so they do not share information about which models have been loaded into memory. This may be the cause of #26, but they may have simply been experiencing the fact that MLX models increase their memory usage with context usage (unlike GGUF).
To produce the issue: note: due to the nature of FastAPI workers, the following is more likely to occur the higher the number of workers
fastmlx % uvicorn fastmlx:app --workers 2
Suggested: Change the default number of workers to 1 until a better approach to CPU parallelism is implemented.