Why is your TD-error increasing?

awjuliani / successor_examples

Tutorials on learning and using successor representations.

MIT License

50 stars 14 forks source link

Why is your TD-error increasing? #4

Closed SaifAlDilaimi closed 3 years ago

SaifAlDilaimi commented 3 years ago

Hey @awjuliani, thank you for your github repo about successor representations. Currently working on my master thesis and it help me a lot! However, I got one question: Why is your td-error increasing? Shouldn't it be decreasing?

awjuliani commented 3 years ago

Hello, @SaifAlDilaimi,

I am glad the repository has been helpful to you. Because of the bootstrapping involved in the successor target during the update as well as the fact that the policy used to update the SR is random, you should expect the error to increase, as values continue to be propagated throughout the environment. I believe it is the case that if you ran the algorithm long enough, used a sufficiently small learning rate, and a deterministic policy, it should eventually converge to zero TD-error.

SaifAlDilaimi commented 3 years ago

Hey @awjuliani, thank you very much for your answer. It helped me alot!

However, I got anothér question. I'm trying to implement an aiming task where the agent should always know where the goal is. I'm was wondering if thats somehow beneficial and if you have an idea how to implement it. Would it be enough to represent the goal as a vector and multiply it with the SR matrix during sampling an action?

awjuliani commented 3 years ago

You definitely have the right intuition. One thing to ensure is that whatever you use for your goal vector must enable the reward/value function to be a linear function of the successor matrix and the goal vector. This is typically accomplished by either learning the goal vector online, or by hand-crafting a goal which satisfies that criteria.

SaifAlDilaimi commented 3 years ago

Oh, that was fast! I'm I right that a "handcrafted" goal vector would be the state representation of the goals position in the environment? At least in your example?

awjuliani commented 3 years ago

That is correct.