Open chain-of-immortals opened 3 months ago
@dusty-nv bump.
Basically this is what I mentioned few times during our conversations. In my use case where I inference MLCModel
(no vision) in loop - after around 100 samples I get OOM
and the process get's killed.
Tried to do gc
after each inference iteration, even chat_history.reset()
won’t help:
I did a memory profiling and looks like it's chat_history.embed_chat()
, where embeddings are joined together using np.concatenate
:
Filename: /opt/NanoLLM/nano_llm/chat/history.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
344 1949.5 MiB 1949.5 MiB 1 @profile
345 def embed_chat(self, use_cache=True, max_tokens=None, wrap_tokens=None, **kwargs):
346 """
347 Assemble the embedding of either the latest or entire chat.
348
349 If ``use_cache=True`` (the default), and only the new embeddings will be returned.
350 If ``use_cache=False``, then the entire chat history will be returned.
351
352 This function returns an ``(embedding, position)`` tuple, where the embedding array
353 contains the new embeddings (or tokens) from the chat, and position is the current
354 overall position in the history (up to the model's context window length)
355
356 If the number of tokens in the chat history exceeds the length given in ``max_tokens`` argument
357 (which is typically the model's context window, minus the max generation length),
358 then the chat history will drop all but the latest ``wrap_tokens``, starting with a user prompt.
359 If `max_tokens` is provided but `wrap_tokens` is not, then the overflow tokens will be truncated.
360 """
361 1949.5 MiB 0.0 MiB 1 embeddings = []
362 1949.5 MiB 0.0 MiB 1 position = 0
363
364 1976.4 MiB 0.0 MiB 5 for n, msg in enumerate(self.messages):
365 1976.4 MiB 0.0 MiB 4 if use_cache:
366 if msg.cached:
367 position += msg.num_tokens
368 else:
369 embeddings.append(msg.embed())
370 use_cache = False # all entries after this need to be included
371 else:
372 1976.4 MiB 26.9 MiB 4 embeddings.append(msg.embed())
373
374 1976.4 MiB 0.0 MiB 4 if not use_cache and logging.getLogger().isEnabledFor(logging.DEBUG) and (len(self.messages) - n < 5):
375 logging.debug(f"chat msg {n} role={msg.role} type={msg.type} tokens={msg.num_tokens} `{msg.template if msg.template else msg.content if isinstance(msg.content, str) else ''}`".replace('\n', '\\n'))
376
377 1976.4 MiB 0.0 MiB 1 entries = len(embeddings)
378 2000.2 MiB 23.8 MiB 1 embeddings = np.concatenate(embeddings, axis=1) #, position
379
380 2000.2 MiB 0.0 MiB 1 '''
381 if max_tokens and position + embeddings.shape[1] > max_tokens:
382 if wrap_tokens:
383 self.reset(wrap_tokens=wrap_tokens)
384 embeddings, position = self.embed_chat(use_cache=False, max_tokens=max_tokens, wrap_tokens=wrap_tokens, **kwargs)
385 logging.warning(f"Chat overflow, max history lenth {max_tokens} tokens exceeded (keeping the most recent {embeddings.shape[1]} tokens)")
386 else:
387 logging.warning(f"Truncating chat history overflow to {max_tokens} tokens")
388 return embeddings[:,:max_tokens,:], position
389 '''
390
391 2000.2 MiB 0.0 MiB 1 logging.debug(f"chat embed entries={entries} shape={embeddings.shape} position={position}")
392 2000.2 MiB 0.0 MiB 1 return embeddings, position
Hi guys, thanks for reporting this and providing the charts - will look into this. @ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?
Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False)
to see if it is related to kv_cache caching? You can edit the NanoLLM sources by cloning/mounting it in 'dev mode' like this: https://www.jetson-ai-lab.com/agent_studio.html#dev-mode
Also I take it you are running the normal latest NanoLLM container on JetPack 6? Thanks for the debugging info you have gathered!
on my end...yes im running the latest NanoLLM container on jetpack 6. Thx for looking in this issue.
Ok yea, thanks. Weird thing here is that I have recently been running VLM/VLA benchmarks for hours at a time and have not encountered this. I wonder if your circumstances are resolved in main branch? I will have the 24.8 container release out in the next couple days.
From: chain-of-immortals @.> Sent: Saturday, August 24, 2024 12:36:11 PM To: dusty-nv/NanoLLM @.> Cc: Dustin Franklin @.>; Mention @.> Subject: Re: [dusty-nv/NanoLLM] Steady RAM Usage Increase During Video Inference using video.py (Issue #39)
on my end...yes im running the latest NanoLLM container on jetpack 6. Thx for looking in this issue.
— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/NanoLLM/issues/39#issuecomment-2308449791, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGKZACPRGD65F36GSFTLZTCY7XAVCNFSM6AAAAABM2GGFXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGQ2DSNZZGE. You are receiving this because you were mentioned.Message ID: @.***>
i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution. it is not as noticable when streaming lower resolution streams. Looking forward to testing the new release, when its available. thx.
@ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?
It's streaming and yes it's still not yet using Plugins.
Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False) to see if it is related to kv_cache caching?
@dusty-nv yes, I did that also, but unfortunatelly with the same results:
@dusty-nv would you be so kind to share your benchmark logic? Let me explain how mine works, maybe the issue is in my loop:
Also I take it you are running the normal latest NanoLLM container on JetPack 6?
I'm running dustynv/nano_llm:humble-r36.3.0
image (sha256:6944c57c8b1381fc430dc3ebd0ad5ceec1a63a21853dd1c2c544f7959939506f
) on:
root@ubuntu:/data/nano_llm_ha# cat /etc/nv_tegra_release
# R36 (release), REVISION: 2.0, GCID: 34956989, BOARD: generic, EABI: aarch64, DATE: Thu Nov 30 19:03:58 UTC 2023
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia
@dusty-nv any hints on this?
i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution.
@chain-of-immortals Similar as me - when I reduce the length of system prompt I can go beyond 250 samples before OOM.
Hello,
I’ve been running some tests using the nano_llm.vision.video module with live camera streaming on AGX Orin 64gb model.
with the following parameters, --model Efficient-Large-Model/VILA1.5-13b \ --max-images 5 \ --max-new-tokens 3 \ --prompt 'do you see a moniter in the frame? reply in binary 0 is no and 1 is yes'
I noticed a steady increase in RAM usage during these tests and wanted to get some clarification on what might be causing this.
Here are the details:
Setup: First, I used a USB camera streaming at 640x480 resolution. Then, I tested with another camera streaming at 4K resolution.I have attached the graph of the ram usage in both the cases.
Observation: In both cases, I observed a continuous climb in RAM usage over time, which persisted throughout the streaming session. Much quicker ramp up in the case of 4k images. I’m wondering if this behavior could be attributed to how frames are handled or any other aspects of the video processing pipeline in the script. Is there any known issue or specific configuration I should be aware of that might help address this?
Also How should i think about the optimal size of the video frames i should be feeding this OpenVila1.5 13b model?
Any insights or suggestions would be greatly appreciated.
Thank you!