train_baseline might not working in cleanup environment

joonleesky commented 5 years ago

All of the agent's rewards are saturated to 0 around ~390 episodes while training with the default configurations of cleanup environment in train_baseline.py All of the agents are just getting 0 reward until the end of the training.

eugenevinitsky commented 5 years ago

Sorry for the difficulties you've been having and thank you for trying out the baselines! There's three aspects to this. One, 390 episodes is not a lot if you look at the paper the training curves come from they take many, many more iterations: https://arxiv.org/abs/1810.08647 Two, some of the default parameters of Ray, such as the way the underlying value functions are implemented, may differ from the paper so simply taking the default hyperparameters from the paper may not be sufficient. Three, in my understanding, most of the time the scores in Cleanup actually come out as zero and a score above zero is more unusual than it is likely. Is this third point correct @natashamjaques ?

joonleesky commented 5 years ago

Thank you so much for your kind reply ^_^ !! Still... I have some concerns Even though, I have trained around 20,000,000 time steps (20000 episodes) all of my agent's rewards are just 0 from episode 390 without any change.

For the second part, I will look through some default parameters in ray and will notice you if i find some critical parameters that can affect the performance.

eugenevinitsky commented 5 years ago

One thing I would suggest trying is just disabling the hyperparameters in that file and using the default hyperparameters in Ray, with a possible sweep over the learning rate and the training batch size. To my mind those are usually the most critical hyperparameters.

eugenevinitsky commented 5 years ago

This isn't super unusual, without some initial luck in agents deleting enough of the initial waste cells, they never learn to get any apples. Increasing (making more negative) the value of the entropy coefficient, which should encourage exploration, may also help with the 0 score.

natashamjaques commented 5 years ago

Hey, just to chime in: actually, agents collectively scoring 0 reward in Cleanup is very typical. When I was using these environments in DeepMind, it was understood that 0 was the normal score for A3C agents. If you check out the Inequity Aversion paper https://arxiv.org/pdf/1803.08884.pdf, they report an average collective return of near 0 for 5 A3C agents in Cleanup.

I know that my paper reports a higher score; this was pretty atypical and actually just because I did a super extensive hyperparameter sweep for the baseline. But you can consider 0 the expected score.

Agents have difficulty solving this task both because of the partial observability (they can't see the apples appear when they clean the river), the delayed rewards (have to clean river, walk to apple patch, obtain apples, build that association), and because even if one agent learns to clean the river, the other agents will exploit it so much that they will consume all the apples before it can get to the apples to harvest anything resulting from its cleaning. So it will eventually un-learn the cleaning behavior.

Hope this helps!

joonleesky commented 5 years ago

Thanks for your comments! It helps a lot. I was feeling just kinda weird that the causal influence paper's A3C baseline in cleanup was around 50 ~ 100 and the Inequity Aversion paper's baseline seems to be around 10~50 not 0 all the way.

I am a beginner in MARL and hoping to try some of my ideas to solve these social dilemmas. Just the word super extensive hyper parameter sweep seems to be very scary for me since I'm trying with my personal computer.

btw, I loved the way you encoded the intrinsic motivation and thanks for the reproduction of the environments in open-source.

eugenevinitsky commented 5 years ago

Hi, we've found a few bugs that may be contributing to your difficulties reproducing and will ping you here as soon as they're resolved; apologies! Additionally, you may want to try focusing on Harvest, I've found it to be less sensitive to hyperparams.

joonleesky commented 5 years ago

Actually, I had some fun times experimenting with Cleanup and Harvest environments. With A3C algorithms, I was able to kinda successfully reproduce the results. Below are the results of re-implementing paper, Inequity aversion improves cooperation in intertemporal social dilemmas.

CleanUp - Original Paper

CleanUp - My Experiment

However, few things that made me confusing was that the collective returns have soared too early than reported in the paper. I thought timesteps is equal to the timesteps of the single agent. Is timesteps in this multi-agent settings equal to the timesteps of single-agent * number of agents or is this might be the result of few bugs? Or maybe, am I good at hyper-parameter tuning? LOL

Thank you for your kind replies!!

eugenevinitsky commented 5 years ago

Hi @natashamjaques, I think you might be able to answer this question best?

eugenevinitsky commented 5 years ago

Hi @joonleesky, I'm pretty sure that timesteps is the total number of environment steps. It's perfectly possible that you've just found a better set of hyperparams. Would you mind posting what those hyperparams are so that I can: (1) investigate the issue (2) Put those hyperparams into the project?

eugenevinitsky commented 5 years ago

Hi @joonleesky, I'd still love to know what hyperparams you wound up using!

joonleesky commented 5 years ago

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester. I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

ezgif com-video-to-gif

LUCKYGT commented 3 years ago

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester. I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

Hello @joonleesky , I met the same problem like you. But I didn't find your commitments. Could you give the link of your contributions? I would be very grateful.

eugenevinitsky / sequential_social_dilemma_games

train_baseline might not working in cleanup environment #144