MykolaHerasymovych / Optimizing-Acceptance-Threshold-in-Credit-Scoring-using-Reinforcement-Learning

Research project
90 stars 34 forks source link

Delayed reward #1

Open sramirez opened 5 years ago

sramirez commented 5 years ago

Hi there!

Project looks quite interesting, nice job. I was wondering if you measured the effect of delay in your experiments. I'm facing a similar problem where rewards are quite delayed, maybe too much to apply an effective solution. I've thought about adding some kind of time-based features to improve adaptation. The idea is learn different state representation according to the reward observed in other years/months, etc. What do you think?

MykolaHerasymovych commented 5 years ago

Hello!

Sorry for the delay in reply and thanks you for the kind words. You are right, in my use case the reward was delayed by up to 6 months, but I accounted for that by slightly changing the RL problem definition. In my case, I know for sure which actions are related to which part of the delayed reward, so I update the value function for each action based on the relevant part of reward, even though there are multiple updates per action and updates happen with a couple of months delay. It seems to work quite well, since the more distributed the reward is across the time and the more actions you do, the more seemlessly the value function converges (of course, first updates based on partial reward are usually far from being representative of the final reward). I guess the idea with the delayed reward is to start partial updates as soon as you have at least some estimate of the future reward and correct for new rewards on the go. As far as I remember, it was the main idea behind the Q-learning algorithms.

Not sure if it answers your question, but maybe if you could explain your RL problem and the nature of reward delay in a bit more detail, I would come up with more relevant suggestions.