charliexchen / Ant-IRL

Building a physical robot of the ant from open AI gym, and then training it to walk with actor-critic.
3 stars 1 forks source link

Ant-IRL

Ant-v2 (A.K.A Antony) is now a fairly standard RL task from the Open AI gym library. This project aims to achieve three goals:

...in addition to staying sane over the third lock down in the UK.

Improvements in the robot's speed after training with AAC with experience replay in the simplified environment. Left: pretrained agent with some random noise in actions. Middle: agent after optimising for 33 episodes. Right: agent after optimising for 91 episodes

Building the Robot

First, the robot was designed using Siemens Solid Edge CE (a powerful but free CAD software) and then 3D printed using my Creality Ender 3. The design consists of 8 servos in a similar configuration to Ant-v2 (albeit the forelegs are shorter to reduce the load on the tiny 9g servos). For control, I used an Arduino Nano which communicates with my desktop via USB serial and the servo controller via i2c. The position of each servo can be manipulated by sending two bytes of data.

For sensing, the Arduino is also connected to a gyro/accelerometer via i2c, which gives us acceleration, the gravity vector and Euler angles. Using the MPU-6050 sensor board's onboard DMP feature, it is not necessary to implement further noise reduction (such as Kalman/complement filters) for the sensors. Unfortunately, I eventually had to discard this data since it was just wasn't that helpful for my robot.

As a future upgrade, I have designed micro switch holders for the forelegs which will allow the robot to know if the legs have contacted the ground.

The robot runs on a 5v 2A DC power supply. Power and USB connection is maintained using 6 thin enamelled copper wires (usually used for electromagnets). This minimises any forces a stiff USB/power cable might impart onto the robot.

Location Detection

The Robot's location and orientation relative to the environment is detected via the large aruco marker on top of the robot and markers on the corners of the environment. This was achieved with the aruco module in OpenCV. Under the correct lighting conditions, we can have the location of the robot over 99% of the time, and we can resort to last frame position or interpolating for any missing frames.

Using OpenCV to correct the raw webcam input. Note the small amount of parallax if the camera is not directly overhead. This is usually small enough to not affect the results.

The capture setup consists of a cheap webcam on an angled tripod, pointing downwards. With the locations of the corners of the environment known, perspective and fisheye distortion can be corrected with standard OpenCV operations.

Camera setup and simple walk. Note the trailing wire connected to power and USB.

Building an Environment

With the above setup, we can implement a state-action-reward loop, which forms the basis of the MDP. The robot would walk forth, collect data, train on that data using one's RL algorithm of choice, before resetting.

A hand engineered walk cycle is also implemented. This was done for three reasons:

Walking to specified locations by combining the camera and the hand engineered gait. This is used to reset the environment.

Implementing and Running AAC

I implemented Advantage Actor-Critic using JAX and Haiku. This was a fun framework to use since it is a lot more flexible compared to the others I've tried in the past, such as Keras. Almost any numpy operation can be accelerated, vectorised or differentiated, so I can see this handling more unconventional architectures much more gracefully, even though the overall feature gap isn't huge compared to other frameworks. For example, I especially liked how for autograd, we use grad which is treated like a function operator much like how gradients are represented mathematically.

I made the predictor and AAC class as generic as possible for future projects, and tested that it works on CartPole.

Since this is a physical environment with one agent, we are very data constrained. As such, I first ran a number of episodes using the fixed walk cycle. We collect this data and then use it to pretrain the value function and an actor based on a fixed gait. We then use the above AAC implementation along with experience replay to improve the agent. The initial data also allows for some hyperparameter selection, which was done with a simple random search.

Running multiple episodes with fixed agent in the environment to collect training data. Red rectangles terminate with negative reward, green with positive reward. After an episode, the agent returns to the red circle to reset the environment. Note the aruco markers used to ensure consistent perspective in the corners of the environment.

Other things to note about the AAC implementation:

Fig 1: loss function at each episode of the value critic. Given AAC is on-policy, we can expect this value to not go down as long as the policy has not converged. Fig 2: objective of the actor policy. This is the log-likelihood scaled advantage, and so we expect it to go up as the agent improves. Fig 3: Episodic cumulative reward -- the agent is moving faster as the policy improves.

Tuning RL algorithms is hard. To make my life easier, I followed the best some best practices:

References

Main papers I consulted: