Implement basic RL for simulator

Self explanatory, attempt to do basic value iteration on the simulator. Failing that we will re-evaluate if heuristic space pruning or sample-based Q-learning is more prudent.

This may be split into multiple issues down the line, it may be worth trying both structured RL as well as more simple real-number based RL. For HQ/NQ this should be very similar except for difficult recipes that can be failed (large negative penalty). For collectibles it may be more interesting to evaluate the pros/cons in how you reward the tiers--is simply maximizing quality good enough or should we reward each tier separately since just 1 quality can make a huge reward difference.

Minimum criteria for success:

[ ] It completes running (at this stage it is okay if it takes a long time)
[ ] The policy generated can consistently HQ/tier 3 recipes that should be easy
[ ] Policy rollout about matches up with tests in a simulator
[ ] Simulator must be able to take a problem description as input, and at each "step" will request feedback on current condition, success of random abilities etc.

Non-issues for the moment:

The policy need not look like a human policy
Policy is allowed to make strange choices (e.g. choosing a higher CP cost ability over a lower one because you have enough CP to use either comfortable)
DO NOT MAKE THE OUTPUT TOO FANCY DO NOT INVEST IN SOME FANCY TUI THING OR SOMETHING I KNOW YOU'LL WANT TO. We can focus on good input once we finish milestone 2.

ZoopOTheGoop / ffxiv-crafting-solver

Implement basic RL for simulator #5