Ros2 VLM Node: Options for Video Input & Output

@dusty-nv: In your ROS Deep Learning repo, (https://github.com/dusty-nv/ros_deep_learning), you have included the choice to use different video input types and codec, and likewise have provided choices for the output type. In the launch folder of that package, you have options such as https://github.com/dusty-nv/ros_deep_learning/blob/master/launch/video_output.ros2.launch and https://github.com/dusty-nv/ros_deep_learning/blob/master/launch/video_source.ros2.launch. Further to incorporate these options you have specified what need to be included in https://github.com/dusty-nv/ros_deep_learning/blob/master/package.ros2.xml and therefore these options get included during the ROS2 package build process. These launch files are extremely useful and are still relevant for a variety of use cases. Honestly, I haven't come across any other ROS2 utility that does this so well - thank you for that.

I was wondering if the same can be incorporated for the nano LLMs as well

The set-up for the use case that I have in mind: 1) A remote mobile Robot with limited compute power such as an Orin Nano transmits video feed using rtp/rtsp. It has other sensors as well such as Lidar, Sonar, IMU etc and runs ROS2 nodes publishing and subscribing to various topics. 2) A larger Stationary compute environment such as an Orin AGX/NX runs the Navigation server and optimizes path for the robot (or a cluster of robots) by inferencing the data received which amongst others also includes the output of a nano LLM package. The navigation commands for the robot are published as a ROS2 topic which are then used by the motor driver to navigate the remote Robot.

I am trying to create a Ros Node using the VLM example in https://github.com/NVIDIA-AI-IOT/ros2_nanollm that you had directed me earlier to.

My query: how can i add the video input and output functionalities similar to the in ros_deep_learning repo of yours?

FYI, without using ROS, I can use jetson-containers run $(autotag nano_llm) python3 -m nano_llm.agents.video_query --api=mlc --model Efficient-Large-Model/VILA1.5-3b --max-context-len 256 --max-new-tokens 32 --video-input-codec mjpeg --video-input rtp://@:1234 --video-output display://0 to receive the video feed from a remote robot as the input and this code runs without any issue.

However, want to run this as a Ros2 package.

I tried some more and the following code works!

I would like to somehow extract certain words from the inference that could trigger an alarm. Also as you will note there is ROS2 output that can be subscribed to ie like an Image or a Message. I look forward to suggestions - also not sure if this is the right forum for this post or whether i should use https://github.com/NVIDIA-AI-IOT. Pls let me know. Thanks ` import rclpy from rclpy.node import Node from nano_llm.agents.video_query import VideoQuery import numpy as np

class Video_Query_Subscriber(Node):

def __init__(self):
    super().__init__('video_query_subscriber')

    self.output = VideoQuery(api='mlc', model='Efficient-Large-Model/VILA1.5-3b', max_new_tokens=32, max_context_len=256, video_input_codec='mjpeg', video_input='rtp://@:1234', video_output='display://0').run()

def main(args=None): rclpy.init(args=args)

video_query_subscriber = Video_Query_Subscriber()

rclpy.spin(video_query_subscriber)

# Destroy the node explicitly
# (optional - otherwise it will be done automatically
# when the garbage collector destroys the node object)
video_query_subscriber.destroy_node()
rclpy.shutdown()

if name == 'main': main()`

dusty-nv / jetson-containers

Ros2 VLM Node: Options for Video Input & Output #612