HumanCompatibleAI / overcooked_ai

A benchmark environment for fully cooperative human-AI performance.
https://arxiv.org/abs/1910.05789
MIT License
685 stars 145 forks source link

Dynamic orders #57

Closed bmielnicki closed 2 years ago

bmielnicki commented 3 years ago

List of changes:

  1. Orders are made separate class from recipes, so they can have separate information that does not make sense for recipes (like expire time, expire penalty, more complex calculation of rewards, etc.) These changes allow having orders that:

    • expire with time
    • have their own rewards (that can depend on time before expiring)
    • disappear when fulfilled (or not, depends on the choice)
  2. Orders are parts of OrdersList ("list" in the name is for actual list, not for python class, although is currently implemented using python list).

    • It is mostly a list of orders + methods for supporting operations on them. It can also add new orders every n timesteps.
    • OrdersList is now part of overcooked_state and env, it is replacing all_orders and bonus_orders with seemingly full backward compatibility (they are accepted as input, we have them in the previous form as object properties). This makes some stuff a bit confusing (what before was called in the code order was just Recipe so now the property of OvercookedState "all_orders" returns recipes), but I have not seen other solutions to that.
  1. Calculating possible rewards and adjectives to potting in get_recipe_value changed a bit and now is difficult to make it perfect. Currently, the usage of adjectives does not see the possibility of new recipes appearing.

    • Calculating the optimal recipe considering possibilities of adding new order is quite complex as it would require calculating all the possible combinations of adding the new orders from start to end of the episode and choosing move (in some settings (for example settings where temporary order has too short lifespan to create a soup from scratch when order appears) optimal play can require considering range of possible recipes that can appear in the future temporary orders) of the biggest expected value. Also now calculating truly optimal recipe needs to consider where the players are and the distance between the pot and delivery place - it can make the difference between missing order timer and successful soup delivery.
    • Viable and catastrophic adjectives are less problematic, but still are not clear - all recipes that can appear as order should be considered or only the ones that are already made an order (and creating a recipe that did not appeared in the order yet ranges from optimal play to wasting time and soup for the negligible possibility of some reward).
    • In most of the settings myopic play (fulfilling orders that are already added to the orders list) is close to optimal so it is possible to use reward shaping here. Non-myopic play is difficult to make it work so at the beginning (where reward shaping shines) it is almost always wrong considering how bad AI is at the beginning of the training.
  2. To overcooked env/gridworld there is added "sparse_env_rewards" (reward from env (now only punishments for expired recipes fall into this category)) and "sparse_rewards_sum" (sum of sparse_reward_by_agent and sparse_env_rewards) is same places "sparse_reward_by_agent" are appearing now.

  3. get_recipe_value and get_optimal_possible_recipe has some changes to work with orders that can expire.

As the next step, I will change the overcooked-demo (and then python state visualizations as they are not merged yet) to represent visually temporary orders (e.g. add info about time before expiring of orders). It's better to wait with merging this PR to master to the moment change in overcooked-demo will be made and also reviewed.

codecov[bot] commented 3 years ago

Codecov Report

Merging #57 (27d16be) into master (b0d6997) will increase coverage by 3.12%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   80.76%   83.88%   +3.12%     
==========================================
  Files          10       10              
  Lines        3077     3313     +236     
==========================================
+ Hits         2485     2779     +294     
+ Misses        592      534      -58     
Flag Coverage Δ
no-planners 83.88% <ø> (+3.12%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
overcooked_ai_py/mdp/layout_generator.py 82.40% <0.00%> (+0.05%) :arrow_up:
overcooked_ai_py/mdp/overcooked_env.py 69.62% <0.00%> (+0.57%) :arrow_up:
overcooked_ai_py/planning/planners.py 86.19% <0.00%> (+0.70%) :arrow_up:
overcooked_ai_py/mdp/overcooked_mdp.py 93.58% <0.00%> (+1.58%) :arrow_up:
overcooked_ai_py/agents/agent.py 72.31% <0.00%> (+4.11%) :arrow_up:
overcooked_ai_py/agents/benchmarking.py 65.19% <0.00%> (+11.68%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b0d6997...6b754fb. Read the comment docs.

bmielnicki commented 3 years ago

Added pull request for overcooked-demo https://github.com/HumanCompatibleAI/overcooked-demo/pull/23. Merge this PR only along with changes in overcooked-demo (overcooked-demo PR can be merged first at it is compatible with current master).

bmielnicki commented 3 years ago

New changes:

micahcarroll commented 2 years ago

Given that this is a lower priority issue right now, I'll temporarily close this PR for bookeeping – we can re-open it in the future if we are interested in exploring this direction further.