Bug: _BTgymAnalyzer.next method, "Send response as <o, r, d, i> tuple"

Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library

https://kismuz.github.io/btgym/

GNU Lesser General Public License v3.0

988 stars 261 forks source link

Bug: _BTgymAnalyzer.next method, "Send response as <o, r, d, i> tuple" #84

Closed Harry040 closed 5 years ago

Harry040 commented 5 years ago

Hello @Kismuz I need your help to check my question.

https://github.com/Kismuz/btgym/blob/master/btgym/server.py#L172

a reinforcement learning model should be:

action_t = pi(state_t) state_t+1, reward_t+1 = env.step(action_t) getting reward_t+1 after state is state_t and take action_t.

But, the below code , I don't think it's correct.

class _BTgymAnalyzer(bt.Analyzer):
      def next():
            state = self.strategy.get_state()       (state_t+1)
            reward = self.strategy.get_reward()  (reward_t+1)
            # get action 
            self.message = self.socket.recv_pyobj()  (which action?, action_t+1? action_t? or action_t-1? )
            self.socket.send_pyobj((state, reward, is_done, info))

Really appreciate your effort to build this awesome proj Thank you in advance

Kismuz commented 5 years ago

@hongguangguo, thank you for good question;

self.message = self.socket.recv_pyobj() (which action?, action_t+1? action_t? or action_t-1? )

using your notation, it is action_t+1; it is stored and will be executed by strategy during next iteration; what can be confusing here is that actually it is reward_t-1 that is included in (state, reward, is_done, info). You can think of it as of reward function definition: it just means that effect of taking action is one-step delayed due to domain specific dynamics. It is correct definition because in this domain r(s, a, s') = r(s, s') and formally redefining MDP with r*(s', s'') =def= r(s, s') doesn't violate markov property of process.

Harry040 commented 5 years ago

Thank you for your detail explanation. It's actually one action step delayed. But I don't get why it doesn't matter. There are some refers to look into detail?

Harry040 commented 5 years ago

In match order engine, A Another way is: pseudo code

while True:
    1. a_t = receive_action()
    2. broker.execute_order_logic(a_t)
    3. wait for state_t+1 and while matching orders submitted
    4. compute the reward
    send(state_t+1, reward, info, is_done)

Kismuz commented 5 years ago

@hongguangguo,

begin
set is_done = False
set action = 'hold'   # do nothing
while not is_done:
           1. broker.execute_order_logic(action)
           2. compute state, is_done, info
           3. compute reward
           4. receive action
           5. send (state, reward, info, is_done)
end

Harry040 commented 5 years ago

I got it, Thank you @Kismuz .

Kismuz commented 5 years ago

@hongguangguo, after reviewing the discussion and staging some experiments I agree that issue mentioned is indeed a violation of theoretical framework. From practical point of view it causes unwanted delay in environment response, causing performance deterioration.

Fixed, thank you for pointing it out (better late than never:).

Harry040 commented 5 years ago

@Kismuz Aha! Inspired by your awesome project, I also developed a trade env with gym and pyalgotrade. Than you.

Kismuz commented 5 years ago

@hongguangguo, I wish you a good luck with your project! Pyalgotrade is an awesome library providing parallel execution right from the box. That is ok for distributed training but, as in in case of BTGym, has some limitations. I wrote a brief note on that: https://docs.google.com/document/d/1hNM-JvKwMVJJhP4oIs0Kqax3A2xpTpbJk7LryyWJstE/edit?usp=sharing