Adding time dependence to reward function

When we use the qlearning algorithm on the physical bike, the environment will be changing over time. There are two ways to deal with this:

Over the course of updating the values in the q-matrix, use the most recent reward function available. This makes the most intuitive sense, but is not guaranteed to converge to a possible physical path.
Add another dimension to the state, namely time. Explicitly estimate the reward function in the future, using that to fill in future values, before we have more knowledge about them. The problem with this is that some of the benefits of q-learning come with certain states leading to others being optimal, eventually leading back to themselves in a loop. This solution gets rid of all loops between states because you can't travel backwards (or sideways) in time.

Because it is unclear (at least to me) which solution is more promising, it would be nice to be able to easily toggle between the two methods.

da-luce / cornell-autobike