Open tianyunzhe opened 4 months ago
We train a TD3 policy for 0.5M - 1M interaction steps (depends on task), store the checkpoint, and unroll the stored checkpoint with a small gaussian noise to collect such data, so we call it (near-)expert data. Train and eval expert data are generated in the same way, and we split them for evaluation. The conclusion and relative performance still hold for our method if you use other data (e.g. sup, semi).
It's should be easy to implement, but I'll also add the code support in the data collection branch.
Thank you for sharing your code.
Could you please explain how to generate the dataset for _maskdp_evalexpert?
Additionally, could you provide the code and commands for this part? Thank you.