abhisheknaik96 / differential-value-iteration

Experiments in creating the ultimate average-reward planning algorithm
Apache License 2.0
0 stars 2 forks source link

Adds quick port of MDVI Control 2. #38

Closed btanner closed 3 years ago

btanner commented 3 years ago

Tests are updated to run with MDVI Control 2.

Control 1 and 2 converge to something on all of the same problems with the same hypers.

@abhisheknaik96 @yiwan-rl

However, in policy_test.py, MDVI Control 1 , RVI, DVI converge to the SAME policies on: MDP1, MDP2, GARET 1/2/3.

MDVI Control 2 DOES NOT converge to the same policy as MDVI Control1 on 2 of the GARET tasks.

I have not dug into this at all yet. Just trying to get some progress updates to you folks at the end of my day!

btanner commented 3 years ago

This updated version fixes the bug with the previous test. MDVI Control 1 and 2 now reach the same policies on all our test problems.

btanner commented 3 years ago

@yiwan-rl This is an example of what I mean by vectorizing. I really like MDVI Control 2 :)

btanner commented 3 years ago

@abhisheknaik96 @yiwan-rl This PR has been updated to include vectorized sync/async MDVI Control 2, and it passes all tests (including a new async policy check vs RVI and DVI.