RobotecAI / rai

RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.
Apache License 2.0
82 stars 8 forks source link

Key frames node for camera #213

Open adamdbrw opened 1 week ago

adamdbrw commented 1 week ago

Is your feature request related to a problem? Please describe. Vision Language Models are useful for understanding based on images. In robotics, the environment is dynamic and images from camera sensor(s) come at a high frequency. However, VLMs have high response latencies, so there is a gap in how robots can perceive their environment. Key frame extraction should be configurable and happen in real time. One-second old data for the last frame is acceptable. Key frames should be matched with poses and possibly other data at their timestamp.

Describe the solution you'd like A node that processes visual data from Image topic (we can start with one) continuously and extracts key frames for VLMs, which can be presented as an image mosaic (provided VLM can understand mosaics).

This node should be multi purpose and also output an entire task or runtime visual history as a series of key-frames for memory and reporting purposes. Such features need not to be in the first implementation, but kept in mind for the design. A service to get all recent images (since the last call of the service) should be a part of the node interface.

Describe alternatives you've considered Image capture right for the status update, but it can miss things that happened in between.

Additional context This is well suited for a rclcpp node.

adamdbrw commented 1 week ago

After giving it a bit of a thought, the interface should be:

adamdbrw commented 1 week ago

The same node could keep a history of images (pose, image, timestamp)