Actions we could take would be any sub movement. Turning on each individual motor could count as an action that could be taken, or we could look at the entire sub and define left, right, up, movements.
States we could use would either be X,Y,Z coordinates or some sort of orientation vs the finishing cones
Rewards we could use would be time. We could reward it greater for the less time it takes to find the finish. We would also have a time limit on runs. If we have distance from the cones as a usable variable, this would also determine the reward.
Create python files to send ROS messages to control the model via its plugin.