Farama-Foundation / D4RL

A collection of reference environments for offline reinforcement learning
Apache License 2.0
1.34k stars 286 forks source link

Question about data collection #71

Closed IcarusWizard closed 3 years ago

IcarusWizard commented 3 years ago

Hi, thank you for providing such a great work. I have some question about the data collection method. I look through the code, and find this may be relevant:

https://github.com/rail-berkeley/d4rl/blob/1ed16f94b74d9d7ee60fa399746ace754bc0b838/scripts/ope_rollout.py#L29-L30

However, since the model is referred as an ONNX model, I am confused on how the noise are used when generate actions. Is the noise directly added to the deterministic sample of the actor, or used by a VAE style actor as a latent code, or something else?

justinjfu commented 3 years ago

Hi, sorry for the delay.

The neural network policy outputs parameters of a gaussian distribution - namely, a mean and standard deviation. An action sample can therefore be produced by sampling a noise variable from a unit gaussian, scaling it by the standard deviation, and adding the mean.

In other words, the policy is structured as action = noise * std(obs) + mean(obs), where std and mean are dependent on the network parameters and the observation.