ger01d / opencat-gym

Gym reinforcement learning environment for OpenCat robots.
MIT License
39 stars 6 forks source link

How do you calculate the reward for Bittle robot? #2

Closed metalwhale closed 4 months ago

metalwhale commented 4 months ago

Hello! First of all, thank you so much for sharing your work. This is awesome and it will help me a lot when learning RL for real-life robots. I have some questions. Could you please answer when you have time? (I also asked on Petoi forum. I'm sorry for any inconvenience, but I think asking on GitHub can help more people who have the same question as me.)

How does your model make the robot move forward? As I understand, you use IMU acceleration data along with yaw, pitch, roll, and current angles as inputs for the neural network. However this data alone doesn't indicate if the robot is moving forward, as velocity is needed for calculating reward, not just acceleration.

Am I missing something? Thank you in advance!

ger01d commented 4 months ago

Hello metalwhale,

during training in simulation the reward function receives as an input the x-position (lateral direction) of the robot and calculates the reward for each step. Later the speed or movement of the robot is unknown to the robot, but it is still able to move forward based on the angles and the joint angles (and the joint angles history). Please note, that the speed and x-position isn't part of the observation space of the agent.

Check line 74+75, 181 and also 196:

# The observation space are the torso roll, pitch and the 
# angular velocities and a history of the last 30 joint angles.

current_position = p.getBasePositionAndOrientation(self.robot_id)[0][0] 
movement_forward = current_position - last_position
metalwhale commented 4 months ago

@ger01d

Please note, that the speed and x-position isn't part of the observation space of the agent.

Thank you for sharing! This is interesting. I will dig deeper into your code.

I have 2 other questions:

metalwhale commented 4 months ago

@ger01d

I think I get it. The observation space includes joint angles and gyro data, which represent the state of the robot itself and don't include information about the surrounding environment. When you trained the model in simulation, you calculated the reward using speed and x-position. The model will later be deployed to a real robot, but this time we don't need to retrieve speed and x-position since we only needed them during training.

Am I correct? :D

If this is true, I wonder if it is possible and how much the performance can be improved if we use real data to train the model rather than data from simulation.

ger01d commented 4 months ago

Yes, that's correct. ;)

I already tried training Bittle on the real hardware with the RL algorithm SAC (Soft Actor Critic) that is said to be more sample efficient. In this case I used a cheap high speed camera to track a qr code (aruco library) to retrieve the movement in x-direction. But the training progress was not successful and it was quite time consuming to set the robot back each time the episode restarted. Training in simulation has the benefit, that you can generate a lot of training data in a decent amount of time and it's safer for the robot. The downside is the sim2real transfer, which is quite challenging, because of differences between simulation and reality.

metalwhale commented 4 months ago

@ger01d

That makes sense!

I'm wondering if it is possible to use acceleration data from a gyroscope to determine which direction the robot is moving. Theoretically (IMHO), at the moment we send a command to motors to make the robot move forward, acceleration data on the x-axis becomes positive. If we can capture this data at that exact time, we can use it to compute reward: the more positive and larger cumulative x-axis acceleration data, the further the robot is moving forward.

What do you think? TBH I'm not sure, but I want to try this.

ger01d commented 4 months ago

Theoretically you can use of course the IMU accelerations to determine the movement in space, since the velocity is the integral of the acceleration and the position is the integral of the velocity. Regarding practical aspects you will have to find methods to reduce noise in the signal, because the RL algorithm will be very sensible to the data. For instance if the sensor has a negative acceleration in the moment you collect the data your reward will be negative. If this result was just from noise and the real movement in space is in positive direction, the RL algorithm will take wrong conclusions.

metalwhale commented 4 months ago

@ger01d

Thank you so much for your kind support! I'm still a newbie to reinforcement learning and really appreciate your help.

I have no more questions, but if you don't mind, please let me ask again later when I have other ideas to discuss. Closing this issue with faith ;)