Training details about MineAgent

mansicer commented 1 year ago

Hi. Thank you for releasing the precious benchmark! I'm working on implementing the PPO agent you reported in the paper. However, I found some misalignments between the code and your paper.

Trimmed action space

As mentioned by #4, the code below does not correspond to the 89 action dims in Appendix G.2.

https://github.com/MineDojo/MineCLIP/blob/e6c06a0245fac63dceb38bc9bd4fecd033dae735/main/mineagent/run_env_in_loop.py#L75

About the `compass` observation

In the paper I see that the compass has a shape of (2,). However, I see an input of (4,) shape in your code.

https://github.com/MineDojo/MineCLIP/blob/e6c06a0245fac63dceb38bc9bd4fecd033dae735/main/mineagent/run_env_in_loop.py#L25

Training on `MultiDiscrete` action space

Is the 89-dimension action space in the paper a MultiDiscrete action space like the original MineDojo action space, or you simply treat it as a Discrete action space?

In addition, can you release the training code on three task groups in the paper (or share this code via my GitHub email)? It will be beneficial for baseline comparisons!

iSach commented 1 year ago

Hello,

Did you manage to reimplement the training code for the agents with PPO?

I'm getting some issues with the nested dicts despite using the multi-input policy.

mansicer commented 1 year ago

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

iSach commented 1 year ago

@iSach Hi. I tried to reimplement PPO from CleanRL code. I use Gym's AsyncVectorEnv for sampling and manually preprocess the batched Dict obs space in some ways. Feel free if you wanna elaborate relevant issues.

I'm not extremely familiar with running more complex environments like these (have only run very basic envs in gym's tutorials). Do you have a repo or a gist to look at?

My main issue is dealing with the nested dict's in the env's observation space. I tried to implement a custom features extractor based on the "SimpleFeatureFusion", but can't manage to get something running at all.

mansicer commented 1 year ago

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

iSach commented 1 year ago

Do you have a repo or a gist to look at?

Unfortunately, currently not. I don't think my previous code is bug-free or worth referencing. However, I do suggest that you can start from their provided code like run_env_in_loop.py and try feeding the environment input to the network first. I do start from their example code.

I tried, but I'm getting so many problems with PPO because of the weird environment. I don't understand how to get a clean training code. I don't understand why they would release everything except the code for reproducing results. Especially considering the few tasks demonstrated in the code.

elcajas commented 1 year ago

About the policy algorithm training:

Do you start PPO update when PPO buffer is full or after a certain number of env steps?
Do you use a data loader in PPO update? What is the batch size?
How many PPO update iterations do you apply?
What is SI buffer max capacity?
Does the value function head updates also the backbone model parameters?
Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?
Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?
When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I would appreciate it if you can clarify the points above. It would be helpful if you release the policy training code in the future.

mansicer commented 1 year ago

Hi @elcajas,

Since the authors do not reply this issue, I do not continue reimplementing PPO in MineDojo. For things I can share, I implement PPO based on the CleanRL version and adopt a vector env to speed up. The network backbone is like the FeatureFusion from this repo.

Do you start PPO update when PPO buffer is full or after a certain number of env steps?

After a fixed number of env steps.

Do you use a data loader in PPO update? What is the batch size? Other hyper-parameters...

I refer to the CleanRL code and Table A.3 from the MineDojo paper.

Does the value function head updates also the backbone model parameters?

Yes.

Since a trimmed version of the action space is used, does the agent still use the MulticategoricalActor?

No. Use the default discrete version of PPO is okay.

Using MineCLIP reward, how to store states with corresponding rewards? How to calculate the rewards of the first 15 steps of the episode?

Unfortunately I haven't tried that.

When adding successful trajectories to SI buffer, when do you update the mean and std of the reward?

I'm not clear about this question. Can you provide some details?

Generally, that's just some of my experiences although I do not work on it recently. I sincerely hope the authors and our community can open-source some RL approaches to this benchmark.

mansicer commented 1 year ago

Also found a bug in the example code. See #11.

AsWali commented 5 months ago

@elcajas I am having the same questions as you. Did you get any further ?

MineDojo / MineCLIP