Delayed reward - Githubissues

Hello!

Sorry for the delay in reply and thanks you for the kind words. You are right, in my use case the reward was delayed by up to 6 months, but I accounted for that by slightly changing the RL problem definition. In my case, I know for sure which actions are related to which part of the delayed reward, so I update the value function for each action based on the relevant part of reward, even though there are multiple updates per action and updates happen with a couple of months delay. It seems to work quite well, since the more distributed the reward is across the time and the more actions you do, the more seemlessly the value function converges (of course, first updates based on partial reward are usually far from being representative of the final reward). I guess the idea with the delayed reward is to start partial updates as soon as you have at least some estimate of the future reward and correct for new rewards on the go. As far as I remember, it was the main idea behind the Q-learning algorithms.

Not sure if it answers your question, but maybe if you could explain your RL problem and the nature of reward delay in a bit more detail, I would come up with more relevant suggestions.

MykolaHerasymovych / Optimizing-Acceptance-Threshold-in-Credit-Scoring-using-Reinforcement-Learning

Delayed reward #1