RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.
Is your feature request related to a problem? Please describe.
Vision Language Models are useful for understanding based on images. In robotics, the environment is dynamic and images from camera sensor(s) come at a high frequency. However, VLMs have high response latencies, so there is a gap in how robots can perceive their environment. Key frame extraction should be configurable and happen in real time. One-second old data for the last frame is acceptable. Key frames should be matched with poses and possibly other data at their timestamp.
Describe the solution you'd like
A node that processes visual data from Image topic (we can start with one) continuously and extracts key frames for VLMs, which can be presented as an image mosaic (provided VLM can understand mosaics).
This node should be multi purpose and also output an entire task or runtime visual history as a series of key-frames for memory and reporting purposes. Such features need not to be in the first implementation, but kept in mind for the design. A service to get all recent images (since the last call of the service) should be a part of the node interface.
Describe alternatives you've considered
Image capture right for the status update, but it can miss things that happened in between.
Additional context
This is well suited for a rclcpp node.
After giving it a bit of a thought, the interface should be:
A service to query whether there was anything relevant / changed in the camera feed relative to last query.
A service to get a single (image + text) pair that summarizes visually and textually what happened in the interval since last query, as well as it is possible.
Implementation might use similarity index (such as SSIM), methods to caption / describe images, semantic distance methods such as bag of words, etc.
Is your feature request related to a problem? Please describe. Vision Language Models are useful for understanding based on images. In robotics, the environment is dynamic and images from camera sensor(s) come at a high frequency. However, VLMs have high response latencies, so there is a gap in how robots can perceive their environment. Key frame extraction should be configurable and happen in real time. One-second old data for the last frame is acceptable. Key frames should be matched with poses and possibly other data at their timestamp.
Describe the solution you'd like A node that processes visual data from Image topic (we can start with one) continuously and extracts key frames for VLMs, which can be presented as an image mosaic (provided VLM can understand mosaics).
This node should be multi purpose and also output an entire task or runtime visual history as a series of key-frames for memory and reporting purposes. Such features need not to be in the first implementation, but kept in mind for the design. A service to get all recent images (since the last call of the service) should be a part of the node interface.
Describe alternatives you've considered Image capture right for the status update, but it can miss things that happened in between.
Additional context This is well suited for a rclcpp node.