Closed Harry040 closed 5 years ago
@hongguangguo, thank you for good question;
self.message = self.socket.recv_pyobj() (which action?, action_t+1? action_t? or action_t-1? )
using your notation, it is action_t+1
; it is stored and will be executed by strategy during next iteration;
what can be confusing here is that actually it is reward_t-1
that is included in (state, reward, is_done, info)
. You can think of it as of reward function definition: it just means that effect of taking action is one-step delayed due to domain specific dynamics.
It is correct definition because in this domain r(s, a, s') = r(s, s') and formally redefining MDP with r*(s', s'') =def= r(s, s') doesn't violate markov property of process.
Thank you for your detail explanation. It's actually one action step delayed. But I don't get why it doesn't matter. There are some refers to look into detail?
In match order engine, A Another way is: pseudo code
while True:
1. a_t = receive_action()
2. broker.execute_order_logic(a_t)
3. wait for state_t+1 and while matching orders submitted
4. compute the reward
send(state_t+1, reward, info, is_done)
@hongguangguo,
begin
set is_done = False
set action = 'hold' # do nothing
while not is_done:
1. broker.execute_order_logic(action)
2. compute state, is_done, info
3. compute reward
4. receive action
5. send (state, reward, info, is_done)
end
I got it, Thank you @Kismuz .
@hongguangguo, after reviewing the discussion and staging some experiments I agree that issue mentioned is indeed a violation of theoretical framework. From practical point of view it causes unwanted delay in environment response, causing performance deterioration.
Fixed, thank you for pointing it out (better late than never:).
@Kismuz Aha! Inspired by your awesome project, I also developed a trade env with gym and pyalgotrade. Than you.
@hongguangguo, I wish you a good luck with your project! Pyalgotrade is an awesome library providing parallel execution right from the box. That is ok for distributed training but, as in in case of BTGym, has some limitations. I wrote a brief note on that: https://docs.google.com/document/d/1hNM-JvKwMVJJhP4oIs0Kqax3A2xpTpbJk7LryyWJstE/edit?usp=sharing
Hello @Kismuz I need your help to check my question.
https://github.com/Kismuz/btgym/blob/master/btgym/server.py#L172
a reinforcement learning model should be:
action_t = pi(state_t) state_t+1, reward_t+1 = env.step(action_t) getting reward_t+1 after state is state_t and take action_t.
But, the below code , I don't think it's correct.
Really appreciate your effort to build this awesome proj Thank you in advance