Open mysl opened 5 years ago
@mysl , seems you are right.Appendix A
of the paper clearly states noise should be fixed for the entire rollout. Layer was adapted from DQN implementation without sufficient expertise, sorry for that.
Do I understand it correctly: noise is fixed for a train batch pass and get resampled with every step when collecting experience?
As a quick fix, to disable noisy_net layer one can pass policy kwarg explicitly, mind tuning entropy regularisation:
from btgym.algorithms.nn.layers import linear
# Policy architecture setup:
policy_config = dict(
class_ref=GuidedPolicy_0_0,
kwargs={
'lstm_layers': (256, 256),
'state_encoder_class_ref': conv_1d_casual_encoder,
'dropout_keep_prob': 0.5,
'linear_layer_ref': linear, <------
}
)
# Algorithm config:
trainer_config = dict(
...,
kwargs=dict(
...,
model_beta=0.05, <------
...,
)
)
Do I understand it correctly: noise is fixed for a train batch pass and get resampled with every step when collecting experience?
My understanding is that the algorithm in the paper for NoisyNet-DQN(appendix C.1), noise is sampled on every environment step. While for NoisyNet-A3C(appendix C.2), noise is sampled on every rollout batch, so in this implementation, maybe we should use a placeholder for the noise, and sample outside of the network?
noise is sampled on every rollout
yes, but isn't that solely in context of gradient estimation (train pass)?
maybe we should use a placeholder for the noise, and sample outside of the network?
yes, if wee need to fix noise at time of data acquisition (see above), no if noise to be fixed for train batch only (can infer size and sample in-graph)
yes, if wee need to fix noise at time of data acquisition (see above), no if noise to be fixed for train batch only (can infer size and sample in-graph)
I think the noise should be fixed when collecting experience as well, since A3C is an on policy algorithm. And this seems agreeing to the pseduo code (line 7) in the paper
Yes, indeed.
As pseudocode shows it is the same noise for collecting and for training (of same rollout);
that mens it sure be placeholder input; but is also essential to keep noise as part of experience;
currently all rollout information packing/unpacking is handled by btgym.algorithm.rollout.Rollout
class which is essentially a nested dictionary of lists; maybe it can be optimal to extend it with new key holding one noise tensor per rollout; noise emitting method could be one of a policy instance (knows required shape an properties) or even .get_initial_features() with dummy output when no noisy-net layers present
Due to time limitations expected time to fix the issue is four to five days. Until that It is best to use a linear layer (mentioned above). If anyone wants to contribute - it is is highly appreciated.
btgym.algorithms.rollout.Rollout:
btgym.algorithms.policy.base.BaseAacPolicy:
btgym.algorithms.runner: add above field processing to runners via policy callback functions, separate: policy step callbacks and rollout callbacks
btgym.algorithms.aac.BaseAac:
hi @Kismuz I was reading the paper "Noisy Network for exploration". And have a question w.r.t its usage in btgym. The paper says that "As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent for the whole roll-out. Consistency among action value functions is ensured by letting the noise be the same throughout each rollout"
It looks to me that in current implementation in btgym, it can't ensure "the noise is the same throughout each rollout", because the training steps and environment steps are executed in different threads, and could be interleaved. Or do I miss anythong? Thanks!