Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library
https://kismuz.github.io/btgym/
GNU Lesser General Public License v3.0
981 stars 259 forks source link

Bug: noisy-net layer #97

Open mysl opened 5 years ago

mysl commented 5 years ago

hi @Kismuz I was reading the paper "Noisy Network for exploration". And have a question w.r.t its usage in btgym. The paper says that "As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent for the whole roll-out. Consistency among action value functions is ensured by letting the noise be the same throughout each rollout"

It looks to me that in current implementation in btgym, it can't ensure "the noise is the same throughout each rollout", because the training steps and environment steps are executed in different threads, and could be interleaved. Or do I miss anythong? Thanks!

Kismuz commented 5 years ago

@mysl , seems you are right.Appendix A of the paper clearly states noise should be fixed for the entire rollout. Layer was adapted from DQN implementation without sufficient expertise, sorry for that. Do I understand it correctly: noise is fixed for a train batch pass and get resampled with every step when collecting experience?

Kismuz commented 5 years ago

As a quick fix, to disable noisy_net layer one can pass policy kwarg explicitly, mind tuning entropy regularisation:

from btgym.algorithms.nn.layers import linear

# Policy architecture setup:
policy_config = dict(
    class_ref=GuidedPolicy_0_0,
    kwargs={
        'lstm_layers': (256, 256),
        'state_encoder_class_ref': conv_1d_casual_encoder,
        'dropout_keep_prob': 0.5,
        'linear_layer_ref': linear,   <------
    }
)

# Algorithm config:
trainer_config = dict(
   ...,
    kwargs=dict(
        ...,
        model_beta=0.05,   <------
        ...,
    )
)
mysl commented 5 years ago

Do I understand it correctly: noise is fixed for a train batch pass and get resampled with every step when collecting experience?

My understanding is that the algorithm in the paper for NoisyNet-DQN(appendix C.1), noise is sampled on every environment step. While for NoisyNet-A3C(appendix C.2), noise is sampled on every rollout batch, so in this implementation, maybe we should use a placeholder for the noise, and sample outside of the network?

Kismuz commented 5 years ago

noise is sampled on every rollout

yes, but isn't that solely in context of gradient estimation (train pass)?

maybe we should use a placeholder for the noise, and sample outside of the network?

yes, if wee need to fix noise at time of data acquisition (see above), no if noise to be fixed for train batch only (can infer size and sample in-graph)

mysl commented 5 years ago

yes, if wee need to fix noise at time of data acquisition (see above), no if noise to be fixed for train batch only (can infer size and sample in-graph)

I think the noise should be fixed when collecting experience as well, since A3C is an on policy algorithm. And this seems agreeing to the pseduo code (line 7) in the paper

image

Kismuz commented 5 years ago

Yes, indeed. As pseudocode shows it is the same noise for collecting and for training (of same rollout); that mens it sure be placeholder input; but is also essential to keep noise as part of experience; currently all rollout information packing/unpacking is handled by btgym.algorithm.rollout.Rollout class which is essentially a nested dictionary of lists; maybe it can be optimal to extend it with new key holding one noise tensor per rollout; noise emitting method could be one of a policy instance (knows required shape an properties) or even .get_initial_features() with dummy output when no noisy-net layers present

Kismuz commented 5 years ago

Due to time limitations expected time to fix the issue is four to five days. Until that It is best to use a linear layer (mentioned above). If anyone wants to contribute - it is is highly appreciated.

Kismuz commented 5 years ago

TODO checklist:

btgym.algorithms.rollout.Rollout:

btgym.algorithms.policy.base.BaseAacPolicy:

btgym.algorithms.runner: add above field processing to runners via policy callback functions, separate: policy step callbacks and rollout callbacks

btgym.algorithms.aac.BaseAac: