Closed IcarusWizard closed 3 years ago
Hi, sorry for the delay.
The neural network policy outputs parameters of a gaussian distribution - namely, a mean and standard deviation. An action sample can therefore be produced by sampling a noise variable from a unit gaussian, scaling it by the standard deviation, and adding the mean.
In other words, the policy is structured as action = noise * std(obs) + mean(obs)
, where std
and mean
are dependent on the network parameters and the observation.
Hi, thank you for providing such a great work. I have some question about the data collection method. I look through the code, and find this may be relevant:
https://github.com/rail-berkeley/d4rl/blob/1ed16f94b74d9d7ee60fa399746ace754bc0b838/scripts/ope_rollout.py#L29-L30
However, since the model is referred as an ONNX model, I am confused on how the noise are used when generate actions. Is the noise directly added to the deterministic sample of the actor, or used by a VAE style actor as a latent code, or something else?