arcee-ai / fastmlx

FastMLX is a high performance production ready API to host MLX models.
https://arcee-ai.github.io/fastmlx/
Other
227 stars 25 forks source link

Multiple workers do not share memory, which causes a full model reload for each message generation. #29

Open ZachZimm opened 3 months ago

ZachZimm commented 3 months ago

In the current workers implementation, each worker creates its own ModelProvider, and so they do not share information about which models have been loaded into memory. This may be the cause of #26, but they may have simply been experiencing the fact that MLX models increase their memory usage with context usage (unlike GGUF).

To produce the issue: note: due to the nature of FastAPI workers, the following is more likely to occur the higher the number of workers

  1. Start the server with the command fastmlx % uvicorn fastmlx:app --workers 2
  2. Send a response request (streaming or otherwise)
  3. See that the server output model load time / check memory usage
  4. Send another response request
  5. Notice another 'Model loaded in X seconds' message as well as well as doubled memory usage.

Suggested: Change the default number of workers to 1 until a better approach to CPU parallelism is implemented.

Blaizzy commented 2 months ago

Hey @ZachZimm

Thanks a lot!

This makes a lot of sense.

I thought having 2 workers will allow all users to have parallel calls enabled by default.

But it seems I didn't account for this issue.

I will make workers default to 1 and investigate how to have all workers talk or the master node handle routing.

ZachZimm commented 2 months ago

I'd like to implement a memory sharing solution using multiprocessing.manager, essentially swapping out the ModelProvider.models dict with a shared memory multiprocessing.manager.dict. But, I would appreciate it if you would merge my previous PR first as I am not very familiar with git and would like to avoid having 2 working branches.

Blaizzy commented 2 months ago

hey @ZachZimm

Thanks for the patience, I was unavaible this last week.

That would be fantastic, no problem!

Blaizzy commented 2 months ago

I left some comments

ZachZimm commented 2 months ago

So I tried implementing shared memory with multiprocessing.Manager but I found that mlx's lm_load would hang and eventually complain about something not being serializable. I really don't know what the proper way to go about sharing the model in memory is if this approach doesn't work, but maybe building ModelProvider up into a process separate from the FastAPI app (exclusively for model management, and providing a pointer to the model object to FastAPI workers) would be a workable approach?