Closed synchronic1 closed 1 month ago
Hi @synchronic1 thanks for reporting and sharing these logs. I don't see any issue with them:
Do share your hf_config.json file so I can investigate further!
Hi @synchronic1,
I did some testing today and could not reporduce this, any updates from your end on this issue?
Also heads-up: new v2.0-beta2 released today with various bug-fixes, reporting enhancements and a new feature: Enhanced HF-Waitress LLM Management so you can add new model_ids, search-filter & sort the list of LLMs as well as delete LLM IDs from the HF-Waitress LLM dropdown list, so do git pull
and try out the latest update! No changes to dependencies so just pull & run.
Closing as no update received. On the note of increasing memory usage, transformer models are loaded one shard at a time, so when the model is first loaded you will notice gradually increasing memory usage. This is normal. Beyond this you may see increased usage as the context window is utilized, but this is not unique to HF-Waitress. No other unexpected memory usage increases exist.
This is when you usually will lose any type of verbose response in inferencing
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.33 GiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 26.41 GiB is allocated by PyTorch, and 23.08 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) LLM stream done, releasing semaphore
hf_server_log.log lars_server_log.log