Causal influence agents do not learn

internetcoffeephone commented 4 years ago

When using the script in the visible_actions branch, combined with the ray branch of causal_ppo, the agents seem to keep hovering around 0 reward.

Some caveats: the entropy in the run script should be corrected to be negative, and the causal_a3c branch of ray is not used because it contains an error where loss_weight is not applied in a3c_policy_graph_causal. Additionally, KL 0/NaN handling is done differently, see this file. Finally, the ray 0.6.4 requirement should be removed from requirements.txt.

Can the results of the paper be reproduced like this, or am I incorrectly using the scripts?

eugenevinitsky commented 4 years ago

Hi, as we've hopefully warned in the setup instructions, that branch is untested. It's something I've set up for personal use but haven't finished debugging. It's plausible that there are bugs there! We wrote up the env but never got around to properly testing that code out. If you do manage to get it to work please let me know!

P.s. thank you for pointing out the entropy sign and the other mistakes in the branch.

eugenevinitsky commented 4 years ago

Do you have a branch up with a more correct implementation?

internetcoffeephone commented 4 years ago

Sure! I just made the branches, Ray and SSD. These contain the fixes mentioned above, and some other fixes that stop deprecation warnings from flooding the debug log.

Note that I changed run script names to enforce more consistency. The influence moa model training is started by running train_influence_moa.py. (I also implemented a config file, but I understand that it's hard to safely merge in with all these name changes).

eugenevinitsky commented 4 years ago

Thank you! I will also be testing these hopefully this week, I'll let you know if we get any good results.

internetcoffeephone commented 4 years ago

Please let me know if you get any bad/unexpected results too, not just good results. :)

I'd like to know in which direction I should be looking to debug the behavior. As @natashamjaques mentioned, the algorithm is very sensitive to hyperparameters. Increasing the entropy reward could help to deal with the premature convergence, but I lack the computational resources to do a large sweep over many possible combinations of learning rate/entropy/moa reward.

There could just be a bug in a3c_policy_graph_causal, but I wouldn't know where/how to start looking.

internetcoffeephone commented 4 years ago

Upon inspection, it appears to me that the model conv_to_fc_net_actions.py does not have the structure that is explained in the paper.

Instead, it looks more like this:

The last_layer property of conv_to_fc_net_actions is fed to both LSTMs in the function get_double_lstm_model in catalog.py. last_layer has a size of (?,72): it consists of a concatted 32 (the last FC layer) + 40 (others_actions).

Am I interpreting this correctly, or does the model actually work according to the upper picture? I cannot find the separate double FC layers, colored red in the upper picture.

eugenevinitsky commented 4 years ago

Hi! I have computational resource limits too. The one thing I'm thinking to try to alleviate that is some of the hyperparameter tuning algorithms in Ray. You could try some of those? I will of course let you know how the project progresses, including any negative results.

My current plan is to port the code to the newest Ray version which should let me use tensorflow eager to set breakpoints in the graph and hopefully add some tests. Also, I think you are correct about the model structure. That is an error. Good catch!

internetcoffeephone commented 4 years ago

Cheers! I'll try to run some experiments as soon as the model has the correct structure. I've gotten started on fixing this.

Just in case you weren't aware: you can use ray.init(local_mode=True) to set breakpoints and debug the model/policy graph as well, although it's less powerful than tf eager.

I'm also wondering what the actual performance bottlenecks are - I tried profiling but couldn't make much sense of the results. Would it be nnet evaluation or stepping over the environment? Adding GPUs doesn't seem to make much of a difference, even when adding enough workers to facilitate high% GPU/CPU usage.

internetcoffeephone commented 4 years ago

I've gotten started on fixing this.

Disregard this, I saw you already got started. The branch seems to be missing the orientation refactor among others though, is that intentional?

eugenevinitsky commented 4 years ago

I'm actually not sure what the performance bottlenecks are. The missing orientation refactor is an accident, I'll merge it in shortly (it shouldn't change anything, it's just so that the env behaves more like you'd expect it to with regards to commands like up or down). It's not 100% complete yet but the current branch is almost there.

eugenevinitsky commented 4 years ago

If you take a look and spot any bugs let me know! The current model I have in the train_moa script contains the fix you brought up.

eugenevinitsky commented 4 years ago

@internetcoffeephone it's now in working mode. That's not to say the hyperparameters are tuned or that there aren't bugs but I think it's pretty close. It's also much more self contained now, you don't need to install a custom branch of ray, just upgrade to ray 0.7.5

natashamjaques commented 3 years ago

FYI there is now a working implementation here: https://github.com/eugenevinitsky/sequential_social_dilemma_games/pull/179/files

eugenevinitsky / sequential_social_dilemma_games

Causal influence agents do not learn #159