bmielnicki commented 3 years ago

List of changes:

Orders are made separate class from recipes, so they can have separate information that does not make sense for recipes (like expire time, expire penalty, more complex calculation of rewards, etc.) These changes allow having orders that:
- expire with time
- have their own rewards (that can depend on time before expiring)
- disappear when fulfilled (or not, depends on the choice)
Orders are parts of OrdersList ("list" in the name is for actual list, not for python class, although is currently implemented using python list).
- It is mostly a list of orders + methods for supporting operations on them. It can also add new orders every n timesteps.
- OrdersList is now part of overcooked_state and env, it is replacing all_orders and bonus_orders with seemingly full backward compatibility (they are accepted as input, we have them in the previous form as object properties). This makes some stuff a bit confusing (what before was called in the code order was just Recipe so now the property of OvercookedState "all_orders" returns recipes), but I have not seen other solutions to that.

Most of the usage of orders inside OrdersList uses sorting them by urgency (not perfect, but better than nothing)

def _order_urgency_sort_key(order, n_timesteps_into_future=0):
    # higher -> more urgent
    return (-order.will_be_expired_in(n_timesteps_into_future), order.is_temporary, -(order.time_to_expire or 0), order.is_bonus, order.calculate_reward()+order.expire_penalty)

Now it is possible to add temporary orders in layouts using the new format.

# layout file
"orders_list": {
"orders":[{"recipe":{"ingredients": ["onion", "onion", "onion"]}, ...],
"add_new_order_every":10,
"time_to_next_order": 10,
"orders_to_add":[{"recipe":{"ingredients": ["onion", "tomato", "tomato"]}, "time_to_expire": 100, "expire_penalty":20, "basic_reward":25, "linear_time_bonus_reward": 1}], ...)
}

not None time_to_expire indicates that order is temporary and will disappear after some time giving negative reward (expire_penalty).

Calculating possible rewards and adjectives to potting in get_recipe_value changed a bit and now is difficult to make it perfect. Currently, the usage of adjectives does not see the possibility of new recipes appearing.
- Calculating the optimal recipe considering possibilities of adding new order is quite complex as it would require calculating all the possible combinations of adding the new orders from start to end of the episode and choosing move (in some settings (for example settings where temporary order has too short lifespan to create a soup from scratch when order appears) optimal play can require considering range of possible recipes that can appear in the future temporary orders) of the biggest expected value. Also now calculating truly optimal recipe needs to consider where the players are and the distance between the pot and delivery place - it can make the difference between missing order timer and successful soup delivery.
- Viable and catastrophic adjectives are less problematic, but still are not clear - all recipes that can appear as order should be considered or only the ones that are already made an order (and creating a recipe that did not appeared in the order yet ranges from optimal play to wasting time and soup for the negligible possibility of some reward).
- In most of the settings myopic play (fulfilling orders that are already added to the orders list) is close to optimal so it is possible to use reward shaping here. Non-myopic play is difficult to make it work so at the beginning (where reward shaping shines) it is almost always wrong considering how bad AI is at the beginning of the training.
To overcooked env/gridworld there is added "sparse_env_rewards" (reward from env (now only punishments for expired recipes fall into this category)) and "sparse_rewards_sum" (sum of sparse_reward_by_agent and sparse_env_rewards) is same places "sparse_reward_by_agent" are appearing now.
get_recipe_value and get_optimal_possible_recipe has some changes to work with orders that can expire.

As the next step, I will change the overcooked-demo (and then python state visualizations as they are not merged yet) to represent visually temporary orders (e.g. add info about time before expiring of orders). It's better to wait with merging this PR to master to the moment change in overcooked-demo will be made and also reviewed.

codecov[bot] commented 3 years ago

Codecov Report

Merging #57 (27d16be) into master (b0d6997) will increase coverage by 3.12%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   80.76%   83.88%   +3.12%     
==========================================
  Files          10       10              
  Lines        3077     3313     +236     
==========================================
+ Hits         2485     2779     +294     
+ Misses        592      534      -58

Flag	Coverage Δ
no-planners	`83.88% <ø> (+3.12%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
overcooked_ai_py/mdp/layout_generator.py	`82.40% <0.00%> (+0.05%)`	:arrow_up:
overcooked_ai_py/mdp/overcooked_env.py	`69.62% <0.00%> (+0.57%)`	:arrow_up:
overcooked_ai_py/planning/planners.py	`86.19% <0.00%> (+0.70%)`	:arrow_up:
overcooked_ai_py/mdp/overcooked_mdp.py	`93.58% <0.00%> (+1.58%)`	:arrow_up:
overcooked_ai_py/agents/agent.py	`72.31% <0.00%> (+4.11%)`	:arrow_up:
overcooked_ai_py/agents/benchmarking.py	`65.19% <0.00%> (+11.68%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b0d6997...6b754fb. Read the comment docs.

bmielnicki commented 3 years ago

Added pull request for overcooked-demo https://github.com/HumanCompatibleAI/overcooked-demo/pull/23. Merge this PR only along with changes in overcooked-demo (overcooked-demo PR can be merged first at it is compatible with current master).

bmielnicki commented 3 years ago

New changes:

GreedyHumanModel now works with temporary (and multiple) orders
- older version of GreedyHumanModel is now SimpleGreedyHumanModel
merged python state visualization into this branch, now it also works with dynamic orders
Sample Agent now is automatically also setting up agent_index and MDP variables to contained Agents and reset them if its reseted EDIT: I'm turning off asserts in visualization tests as for some unknown reason they are passing locally but not on the server. Pixel color differences are very small e.g [204 154 52] instead of [204 153 51]. I suspect it is because I work on a different operating system that differs in rendering slightly - previous visualization test cases were made on Ubuntu and they were passing on Ubuntu (and OSX excluding some similar issues with rendering fonts), current test cases were made on Manjaro.

micahcarroll commented 2 years ago

Given that this is a lower priority issue right now, I'll temporarily close this PR for bookeeping – we can re-open it in the future if we are interested in exploring this direction further.

HumanCompatibleAI / overcooked_ai

Dynamic orders #57

Codecov Report