alxndrTL / Landing-Starships

Make autonomous landing rockets using Deep Reinforcement Learning (Unity ML-Agents)
70 stars 10 forks source link
ml mlagents ppo reinforcement-learning reinforcement-learning-algorithms reinforcement-learning-environments space spacex starship starship-simulator unity

Using Reinforcement Learning πŸ€– to make rockets land πŸš€

​ This project uses artificial intelligence algorithms, and more precisely deep reinforcement learning algorithms, to make an agent learn by itself how to land an orbital class rocket, Starship. The agent, or the algorithm, observes the environment and chooses actions to successfully land the rocket. Before the so-called "training", the agent just got initialized and basically chooses actions at random. As the training continues, the agent gets reward for doing the task we want it to do: landing the rocket. Based on this reward and how he got it, the agent will update the parameters of its neural network in order to make the actions that led to this reward more probable. This project uses the Unity ML-Agents toolkit.

What is Starship ?

​ Starship is a fully reusable launch vehicle which is currently being developed by SpaceX. Starship refers to two things : the rocket as a whole (ie. the first stage, named Super Heavy, and the second stage) as well as the second stage itself. In this project, Starship refers to the second stage. This whole next-gen rocket, with its crazy diameter of 9 meters, will be capable of delivering more than 100 tons to Low Earth Orbit, while having a reusable first and second stage. Starships are currently being built in Boca Chica Village, Texas, and, maybe one day, will bring humans to Mars.

​ If you want daily updates about the development of Starship, I highly suggest you to follow NASASpaceflight on Youtube and Twitter as well as take a look at LabPadre's livestream of the Boca Chica site.

​ The second stage of the rocket, Starship, will perform a special reentry profile to slow down enough as shown below (it will be coming from orbit so will have a high speed w.r.t the air surrounding it) : it will keep a high Angle Of Attack (60Β°) during its descent and then will fall like a sky diver, belly first. A few seconds before touchdown, it will executes a maneuver to bring it from belly flop to tail down to actually land vertical, on its landing legs. This is the part of the landing we're recreating here.

What is Reinforcement Learning ?

​ Reinforcement learning is a subfield of AI that enables autonomous agents to learn by trial and error. RL allows agents to interact with an environment (here, the physical environment with the rocket, the landing platform etc...) and learn from this interaction. The learning is done with rewards, which tell the agent what task we want it to perform. In this case, the agent receives a reward of +1 for successfully landing the rocket, upright at almost no speed. If it fails, then it receives no rewards. Based on the rewards it receives, the agent will then deduce which actions are good (for example, firing the main engine near the landing pad to slow down) and which actions are bad (for example, firing the side thrusters when the rocket is perfectly upright). One last thing : the agent, in order to perform the right actions at the right time, must know where is the rocket, what is its orientation, its speed and more. We thus give him, at each timestep (each time he must take an action), an observation of the environment. In this project, the agent is given at each timestep 11 observations, going from position, speed to rotation of the vehicle.

​ The adjective "deep" refers to the fact that the algorithm used here uses "deep" neural networks, ie. neural networks with multiple hidden units. This "deepness" allows the agent to learn complex relationship between the state the agent sees and the reward he receives.

​ This learning is done, in this case, using the PPO algorithm, developed by OpenAI, and implemented by the Unity team in Python. Learn more about the deep RL implementation in the Technical Details section.

How to reproduce this ?

The Unity project of this experiment is located inside the LandingStarshipsProject folder of the repo.

To run the model and see the rockets land in action, you have to install Unity ML-Agents toolkit on your machine. Link to the installation (it should take a few minutes)

​ To train the model (from scratch or using the pre-trained model, the RocketLanding.nn file), you also need to install the Unity ML-Agents toolkit. You can then follow this guide to start your training. The doc is really useful. Note also that I would suggest you to use the SN20 Unity project to train, as the LandingStarshipsProject Unity project is really made to create videos and cool clips (it has a map, particle effects, animations...).

Of course, you can message me if you have trouble with that, or if you have a question. I'm more likely to respond on Twitter.

If you're willing to learn more about RL, here are some guidance and resources for you to start with.

​ To experiment with Deep RL, you could technically go straight to the implementation, avoid the theory and still get some good results. But of course, learning the theory first is essential if you want to understand (or should I say have a better intuition) your RL algorithm and tweak it appropriately. So first, for the RL side, I would highly suggest you to start with Reinforcement Learning : An Introduction, by Sutton & Barto, which covers the basics concepts of RL. Also, if you're more into learning by videos, then David Silver's series of video on Youtube essentially covers the same stuff as in the book.

​ Now, RL is really great but to implement it on "real-worldish" problems usually requires using deep learning ie. deep neural networks. This is called Deep RL. Learning Deep RL if you know RL isn't hard, though you need to know about supervised learning (things like gradient descent, neural network basics, backprop... for that Andrew Ng's course on Coursera is really good for a first look into these subjects). Then, I would say that you can go with the OpenAI Spinning up in Deep RL, as well as the Deep RL Bootcamp organized by UC Berkeley.

​ If you're French, or if you're at ease with the French language, I'm currently posting a course on Youtube about all that stuff : standard RL, basic supervised learning concepts and Deep RL. At this date, I've mostly covered (not finished yet but soon) the whole standard RL part. I would highly advise you to take a look at it ! Link to the course

Discussion and Technical Details

Results

​ The total number of parameters in the neural network is about 40k. The algorithm trained for a total of about 200 millions timesteps, which, on my (Intel i7 processor) machine, took about 20 hours. A trained agent has a mean reward of about 0.98, meaning that after training, the agent lands Starship successfully ~98% of the time. I could have stopped training way before 20 hours and still get good performance (>95%) but of course, that performance, even the 98% one, isn't acceptable for practical use. With more training and good hyperparameter tuning, I do believe that performance could go way higher but again, the "black box" effect makes the practical use of RL critical for this task.

​ For comparison, DeepMind's famous implementation of Deep Q-learning (DQN) on Atari games uses less than 10K parameters, and trained for a total of 10 millions steps with a CNN neural network (ie. the agent see screenshots of the game, as we humans do). Two reasons can explain the big difference of steps required to learn : first, well, I made this project on my own during a few weeks. Second, the algorithm used by Google DeepMind, called DQN, is fundamentally different than PPO (it's still deep reinforcement learning though). DQN has the advantage of being very data efficient : it can actually learn from the experience generated at anytime during the training. Conversely, PPO throw data away as soon as it has made an update with it : it can't use data from past trajectories.

​ The main difficulty while making this project was getting the right reward function. I tried at the beginning a denser reward function, which could tell the agent how do the task (ie. distance to the pad, orientation of the rocket, collision speed...) but these were very hard to use as the agent maximized them in weird ways. For example, if you penalize the agent for colliding with the ground too fast, then it wont collide at all and thus get no negative reward. In general, you want a reward function that tell the agent what to do, and not how to do it, so that the agent do whatever it wants to actually achieve what you want it to do. The best reward function is thus a reward function that is maximized if and only if the agent performs the task you want it to do.

​ Before testing curriculum learning and imitation learning as exploration strategies, I tried implementing an ICM - Intrinsic Curiosity Module - which creates some kind of intrinsic curiosity that encourages the agent to explore the environment. However, ICM didn't show great results.

​ Concerning hyperparameters, I used a three-layers neural network (2 hidden layers of 128 units), a learning rate of 3.0E-4 and a batch size of 64 (and then increased it later at 128). For more information, please see the configuration files.

​ The algorithm used was used "as is" and didn't get tweaked. Also, the hyperparameters were almost the default ones, and I didn't really got the modify them to get results in the first place. This is something very important as this shows the ease of use of PPO. However, more hyperparameter tuning was required (and still is) to make the learning more stable.

​ The project is not finished, and will never be ahah. Altough I'm very happy with the results I got, the training is stable but in a limited sense : it can be made more stable. A lot of curriculum steps are needed for the agent to learn, and I think that reducing this number of steps would achieve a faster and smoother training. Also, from a simulation point of view, the agent can as of today power on and off the main engine for an unlimited number of times (it can even fire the engine during one timestep and turn it off instantly on the next timestep), this need to be addressed. Also, maybe consider reducing the mass of the vehicle as the fuel is used ? Also, handle the fuel ! It is considered infinite in the simulation as of yet. So many things to do ! With a fellow redditor (from r/SpaceXLounge or r/reinforcementlearning) we thought that it would be a good idea to reduce the probability of the engine actually firing when it is asked for a large number if times (ie. make the engine firing with a 80% probability if the agent already turned it up once). Again, this could be introduced gently usign curriculum learning.