Our framework runs and has been tested on a TIAGo robot. The higher levels of the pipeline (up to the generation of the plan) should be executable in any ROS-based system with RGB-D cameras (camera topics name may be different).
However, depending on the robot, low-level actions should be designed accordingly since different robot can have different capabilities (mobile vs fixed base, arms vs no-arm etc).
Therefore, the GPT prompts in the src/agents.py
file and the low-level grounders in src/low_level_execution.py
and src/primitive_actions.py
should be adapted to the desired robotic platform.
To download the YOLO-World model, visit the official repository, download the .pth weights and put them in the config/yolow/
directory.
To download the pre-trained weights and the encoder and decoder ONNX models follow the instruction on the official repository.
After the download, put the models and weights in the config/efficientvitsam/
directory.
For our tests, we used the l2
model.
python -m spacy download en_core_web_sm
Create the conda environment from the configuration file
conda create -n empower python=3.8.18
then activate it with
conda activate empower
then install the dependencies
pip install -r requirements.txt
mim install mmcv==2.0.0
mim install mmyolo mmdet
Also, set your OpenAI API key:
conda env config vars set OPENAI_API_KEY=<YOUR API KEY>
Clone this repo
git clone https://github.com/Lab-RoCoCo-Sapienza/empower.git
Build the package with catkin
catkin build
Before starting, set the following ROS parameters:
rosparam set /use_case <name>
to name the task you want to solve, and if your robotic platform has speakers, you may wish to enable the speech:
rosparam set /speech True
otherwise set it to False.
Run:
rosrun vlm_grasping create_pcl.py
to create the pointcloud and the image of the desired scene.
Modify the file src/detection.py
to include in the task dictionary the task you want to perform. The key should be the name of the task like you set in the rosparamater, while the value is a description in natural language of this task.
In two different terminals run:
python3 models_cacher.py <name>
and
python3 execute_task.py
The models cacher loads the models in memory in such a way that is possible to run multiple time the execute_task
script without the need to reload them, saving up execution time.
Eventually, execute_task
will produce the plan and will dump it in the corresponding folder.
Now, run:
rosrun vlm_grasping color_pcl.py
to obtain the grounded pointcloud with the segmentation masks projected on it.
Then, run:
rosrun vlm_grasping spawn_marker_centroids.py
to ground the detections in the RViz scene and to keep it populated.
Finally, in another terminal run:
rosrun vlm_grasping low_level_execution.py
to execute the actions needed to achieve the task.