Question about generating expert dataset

We train a TD3 policy for 0.5M - 1M interaction steps (depends on task), store the checkpoint, and unroll the stored checkpoint with a small gaussian noise to collect such data, so we call it (near-)expert data. Train and eval expert data are generated in the same way, and we split them for evaluation. The conclusion and relative performance still hold for our method if you use other data (e.g. sup, semi).

It's should be easy to implement, but I'll also add the code support in the data collection branch.

FangchenLiu / MaskDP_public

Question about generating expert dataset #2