Hello! I am reproducing your paper results (train PPO+self-imitation, with MineCLIP reward), but fail to fill some missing details:
How to implement the agent's 89 discrete actions as said in paper? Currently your MineAgent uses multi-discrete output 3*3*4*25*25*3, which is much larger. Did you remove some action choices?
For computing DIRECT reward using the MineCLIP model, how to sample the negative texts and how many did you sample?
I find the timescale of 1 step in MineDojo simulation is much smaller than 1 second in Youtube videos. Did you use the last consecutive 16 rgb observations to compute reward?
Hello! I am reproducing your paper results (train PPO+self-imitation, with MineCLIP reward), but fail to fill some missing details:
Thank you!