Self explanatory, attempt to do basic value iteration on the simulator. Failing that we will re-evaluate if heuristic space pruning or sample-based Q-learning is more prudent.
This may be split into multiple issues down the line, it may be worth trying both structured RL as well as more simple real-number based RL. For HQ/NQ this should be very similar except for difficult recipes that can be failed (large negative penalty). For collectibles it may be more interesting to evaluate the pros/cons in how you reward the tiers--is simply maximizing quality good enough or should we reward each tier separately since just 1 quality can make a huge reward difference.
Minimum criteria for success:
[ ] It completes running (at this stage it is okay if it takes a long time)
[ ] The policy generated can consistently HQ/tier 3 recipes that should be easy
[ ] Policy rollout about matches up with tests in a simulator
[ ] Simulator must be able to take a problem description as input, and at each "step" will request feedback on current condition, success of random abilities etc.
Non-issues for the moment:
The policy need not look like a human policy
Policy is allowed to make strange choices (e.g. choosing a higher CP cost ability over a lower one because you have enough CP to use either comfortable)
DO NOT MAKE THE OUTPUT TOO FANCY DO NOT INVEST IN SOME FANCY TUI THING OR SOMETHING I KNOW YOU'LL WANT TO. We can focus on good input once we finish milestone 2.
Self explanatory, attempt to do basic value iteration on the simulator. Failing that we will re-evaluate if heuristic space pruning or sample-based Q-learning is more prudent.
This may be split into multiple issues down the line, it may be worth trying both structured RL as well as more simple real-number based RL. For HQ/NQ this should be very similar except for difficult recipes that can be failed (large negative penalty). For collectibles it may be more interesting to evaluate the pros/cons in how you reward the tiers--is simply maximizing quality good enough or should we reward each tier separately since just 1 quality can make a huge reward difference.
Minimum criteria for success:
Non-issues for the moment: