Closed SaifAlDilaimi closed 3 years ago
Hello, @SaifAlDilaimi,
I am glad the repository has been helpful to you. Because of the bootstrapping involved in the successor target during the update as well as the fact that the policy used to update the SR is random, you should expect the error to increase, as values continue to be propagated throughout the environment. I believe it is the case that if you ran the algorithm long enough, used a sufficiently small learning rate, and a deterministic policy, it should eventually converge to zero TD-error.
Hey @awjuliani, thank you very much for your answer. It helped me alot!
However, I got anothér question. I'm trying to implement an aiming task where the agent should always know where the goal is. I'm was wondering if thats somehow beneficial and if you have an idea how to implement it. Would it be enough to represent the goal as a vector and multiply it with the SR matrix during sampling an action?
You definitely have the right intuition. One thing to ensure is that whatever you use for your goal vector must enable the reward/value function to be a linear function of the successor matrix and the goal vector. This is typically accomplished by either learning the goal vector online, or by hand-crafting a goal which satisfies that criteria.
Oh, that was fast! I'm I right that a "handcrafted" goal vector would be the state representation of the goals position in the environment? At least in your example?
That is correct.
Hey @awjuliani, thank you for your github repo about successor representations. Currently working on my master thesis and it help me a lot! However, I got one question: Why is your td-error increasing? Shouldn't it be decreasing?