apprenticelearner / AL_Train

A repository for the CTAT HTML based training harness for Apprentice Learner agents.
MIT License
5 stars 5 forks source link

Need to revise definitions/implementation of examples_only, test_mode, etc. #12

Open DannyWeitekamp opened 3 years ago

DannyWeitekamp commented 3 years ago

There are various situations where we don't want AL to go through a full feedback loop because we don't want AL to produce actions, receive feedback, etc.

From the perspective of AL_Train, typically: 1(A). An act() request is sent to AL for an action <- (i.e. AL needs to self-explain the next step) 2(F). AL's next action(s) is received and feedback is sent back to AL or 3(D). AL has no next actions and an example is sent back to AL

Right now: -examples_only: causes A to happen and D to happen regardless of AL's response -test_mode: causes A. to happen, but not F. or D.

But we would also like a way for D. to happen without A.

These cases (at least) should be possible, lets call them "feedback_modes", potentially the user could just choose among these mutually exclusive options instead of setting flags preventing the three illegal ones: -full/default:   A, F, D <-Normal ITS training loop -nohints :     A, F, <- Warning: Infinite Loop (Really only works if AL has finite action space + tries random things) -predictobserve: A, , D <- Demonstrations are always given
-observeonly:   , , D <- Demonstrations are always given
-test:     A,
, _ <- Moves to next item on first incorrect -stepwisetest: A, , _ <- Moves to next step (without sending demonstration) on incorrect

(, , ), (,F,D), (,F,) are impossible

There is an added complexity if we incorporate other levels of hint beyond bottomout. Additionally no_hints would probably require some kind of empty hint response to be given to AL that would prompt the agent to guess.

eharpste commented 3 years ago

I like this idea as a solution to specifying feedback modes.

Not sure this goes here but another thing to keep in mind around this is to make sure that whatever this mode system ends up being it should show up in the logs so that people can use it to do pre-posttes analysis and other things.