abgulati / LARS

An application for running LLMs locally on your device, with your documents, facilitating detailed citations in generated responses.
https://www.youtube.com/watch?v=Mam1i86n8sU&ab_channel=AbheekGulati
GNU Affero General Public License v3.0
485 stars 35 forks source link

Pytorch memory usage balloons with each subsequent inference or query #23

Closed synchronic1 closed 1 month ago

synchronic1 commented 2 months ago

This is when you usually will lose any type of verbose response in inferencing

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.33 GiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 26.41 GiB is allocated by PyTorch, and 23.08 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) LLM stream done, releasing semaphore

image image hf_server_log.log lars_server_log.log

abgulati commented 2 months ago

Hi @synchronic1 thanks for reporting and sharing these logs. I don't see any issue with them:

  1. lars_server_log: No issues seen here, the initial ECONNREFUSED events are expected as LARS waits for the LLM server to load up and come online
  2. hf_server_log: 'Phi3Config' object has no attribute 'head_dim' error: This is an expected one for Phi3 as that attribute value indeed doesn’t exist: the /health endpoint used to check if the server is online will query several model details, and different LLMs may or may not contain values for some architectural parameters. In that case, it’s noted, and the application moves on, returning whichever details it can, which is what we’re seeing here. I will update the server output to further clarify this in a future update. The LLM is loaded at this point and will work correctly as intended, as it's not affected by this error!

Do share your hf_config.json file so I can investigate further!

abgulati commented 1 month ago

Hi @synchronic1,

I did some testing today and could not reporduce this, any updates from your end on this issue?

Also heads-up: new v2.0-beta2 released today with various bug-fixes, reporting enhancements and a new feature: Enhanced HF-Waitress LLM Management so you can add new model_ids, search-filter & sort the list of LLMs as well as delete LLM IDs from the HF-Waitress LLM dropdown list, so do git pull and try out the latest update! No changes to dependencies so just pull & run.

abgulati commented 1 month ago

Closing as no update received. On the note of increasing memory usage, transformer models are loaded one shard at a time, so when the model is first loaded you will notice gradually increasing memory usage. This is normal. Beyond this you may see increased usage as the context window is utilized, but this is not unique to HF-Waitress. No other unexpected memory usage increases exist.