dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
https://dusty-nv.github.io/NanoLLM/
MIT License
198 stars 33 forks source link

Steady RAM Usage Increase During Video Inference using video.py #39

Open chain-of-immortals opened 3 months ago

chain-of-immortals commented 3 months ago

Hello,

I’ve been running some tests using the nano_llm.vision.video module with live camera streaming on AGX Orin 64gb model.

with the following parameters, --model Efficient-Large-Model/VILA1.5-13b \ --max-images 5 \ --max-new-tokens 3 \ --prompt 'do you see a moniter in the frame? reply in binary 0 is no and 1 is yes'

I noticed a steady increase in RAM usage during these tests and wanted to get some clarification on what might be causing this.

Here are the details:

Setup: First, I used a USB camera streaming at 640x480 resolution. Then, I tested with another camera streaming at 4K resolution.I have attached the graph of the ram usage in both the cases. output

Observation: In both cases, I observed a continuous climb in RAM usage over time, which persisted throughout the streaming session. Much quicker ramp up in the case of 4k images. I’m wondering if this behavior could be attributed to how frames are handled or any other aspects of the video processing pipeline in the script. Is there any known issue or specific configuration I should be aware of that might help address this?

Also How should i think about the optimal size of the video frames i should be feeding this OpenVila1.5 13b model?

Any insights or suggestions would be greatly appreciated.

Thank you!

ms1design commented 3 months ago

@dusty-nv bump.

Basically this is what I mentioned few times during our conversations. In my use case where I inference MLCModel (no vision) in loop - after around 100 samples I get OOM and the process get's killed.

Tried to do gc after each inference iteration, even chat_history.reset() won’t help:

Meta-Llama-3 1-8B-Instruct-prev_memory_usage

I did a memory profiling and looks like it's chat_history.embed_chat(), where embeddings are joined together using np.concatenate:

Filename: /opt/NanoLLM/nano_llm/chat/history.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   344   1949.5 MiB   1949.5 MiB           1       @profile
   345                                             def embed_chat(self, use_cache=True, max_tokens=None, wrap_tokens=None, **kwargs):
   346                                                 """
   347                                                 Assemble the embedding of either the latest or entire chat.
   348                                                 
   349                                                 If ``use_cache=True`` (the default), and only the new embeddings will be returned.
   350                                                 If ``use_cache=False``, then the entire chat history will be returned.
   351                                                 
   352                                                 This function returns an ``(embedding, position)`` tuple, where the embedding array
   353                                                 contains the new embeddings (or tokens) from the chat, and position is the current
   354                                                 overall position in the history (up to the model's context window length)
   355                                                 
   356                                                 If the number of tokens in the chat history exceeds the length given in ``max_tokens`` argument
   357                                                 (which is typically the model's context window, minus the max generation length),
   358                                                 then the chat history will drop all but the latest ``wrap_tokens``, starting with a user prompt.
   359                                                 If `max_tokens` is provided but `wrap_tokens` is not, then the overflow tokens will be truncated.
   360                                                 """
   361   1949.5 MiB      0.0 MiB           1           embeddings = []
   362   1949.5 MiB      0.0 MiB           1           position = 0
   363                                               
   364   1976.4 MiB      0.0 MiB           5           for n, msg in enumerate(self.messages):
   365   1976.4 MiB      0.0 MiB           4               if use_cache:
   366                                                         if msg.cached:
   367                                                             position += msg.num_tokens
   368                                                         else:
   369                                                             embeddings.append(msg.embed())
   370                                                             use_cache = False  # all entries after this need to be included
   371                                                     else:
   372   1976.4 MiB     26.9 MiB           4                   embeddings.append(msg.embed())
   373                                                       
   374   1976.4 MiB      0.0 MiB           4               if not use_cache and logging.getLogger().isEnabledFor(logging.DEBUG) and (len(self.messages) - n < 5):
   375                                                         logging.debug(f"chat msg {n}  role={msg.role}  type={msg.type}  tokens={msg.num_tokens}  `{msg.template if msg.template else msg.content if isinstance(msg.content, str) else ''}`".replace('\n', '\\n'))
   376                                         
   377   1976.4 MiB      0.0 MiB           1           entries = len(embeddings)
   378   2000.2 MiB     23.8 MiB           1           embeddings = np.concatenate(embeddings, axis=1) #, position
   379                                         
   380   2000.2 MiB      0.0 MiB           1           '''
   381                                                 if max_tokens and position + embeddings.shape[1] > max_tokens:
   382                                                     if wrap_tokens:
   383                                                         self.reset(wrap_tokens=wrap_tokens)
   384                                                         embeddings, position = self.embed_chat(use_cache=False, max_tokens=max_tokens, wrap_tokens=wrap_tokens, **kwargs)
   385                                                         logging.warning(f"Chat overflow, max history lenth {max_tokens} tokens exceeded (keeping the most recent {embeddings.shape[1]} tokens)")
   386                                                     else:
   387                                                         logging.warning(f"Truncating chat history overflow to {max_tokens} tokens")
   388                                                         return embeddings[:,:max_tokens,:], position
   389                                                 '''
   390                                                     
   391   2000.2 MiB      0.0 MiB           1           logging.debug(f"chat embed  entries={entries}  shape={embeddings.shape}  position={position}")
   392   2000.2 MiB      0.0 MiB           1           return embeddings, position 
dusty-nv commented 3 months ago

Hi guys, thanks for reporting this and providing the charts - will look into this. @ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?

Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False) to see if it is related to kv_cache caching? You can edit the NanoLLM sources by cloning/mounting it in 'dev mode' like this: https://www.jetson-ai-lab.com/agent_studio.html#dev-mode

Also I take it you are running the normal latest NanoLLM container on JetPack 6? Thanks for the debugging info you have gathered!

chain-of-immortals commented 3 months ago

on my end...yes im running the latest NanoLLM container on jetpack 6. Thx for looking in this issue.

dusty-nv commented 3 months ago

Ok yea, thanks. Weird thing here is that I have recently been running VLM/VLA benchmarks for hours at a time and have not encountered this. I wonder if your circumstances are resolved in main branch? I will have the 24.8 container release out in the next couple days.


From: chain-of-immortals @.> Sent: Saturday, August 24, 2024 12:36:11 PM To: dusty-nv/NanoLLM @.> Cc: Dustin Franklin @.>; Mention @.> Subject: Re: [dusty-nv/NanoLLM] Steady RAM Usage Increase During Video Inference using video.py (Issue #39)

on my end...yes im running the latest NanoLLM container on jetpack 6. Thx for looking in this issue.

— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/NanoLLM/issues/39#issuecomment-2308449791, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGKZACPRGD65F36GSFTLZTCY7XAVCNFSM6AAAAABM2GGFXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGQ2DSNZZGE. You are receiving this because you were mentioned.Message ID: @.***>

chain-of-immortals commented 3 months ago

i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution. it is not as noticable when streaming lower resolution streams. Looking forward to testing the new release, when its available. thx.

ms1design commented 3 months ago

@ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?

It's streaming and yes it's still not yet using Plugins.

Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False) to see if it is related to kv_cache caching?

@dusty-nv yes, I did that also, but unfortunatelly with the same results:

memory_usage

@dusty-nv would you be so kind to share your benchmark logic? Let me explain how mine works, maybe the issue is in my loop:

  1. Load model (streaming=True, use_cache=True)
  2. Inference with empty chat history
  3. reset chat history
  4. repeat point 2 until end of samples

Also I take it you are running the normal latest NanoLLM container on JetPack 6?

I'm running dustynv/nano_llm:humble-r36.3.0 image (sha256:6944c57c8b1381fc430dc3ebd0ad5ceec1a63a21853dd1c2c544f7959939506f) on:

root@ubuntu:/data/nano_llm_ha# cat /etc/nv_tegra_release
# R36 (release), REVISION: 2.0, GCID: 34956989, BOARD: generic, EABI: aarch64, DATE: Thu Nov 30 19:03:58 UTC 2023
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

@dusty-nv any hints on this?

ms1design commented 3 months ago

i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution.

@chain-of-immortals Similar as me - when I reduce the length of system prompt I can go beyond 250 samples before OOM.