Ready for testing 🧪 Multi-policy training support

I've done a little testing with some of my previous envs and Jumper Hard with older plugin version, and it seemed to work with both multiagent set to false in yaml config (possible that number of envs per worker should be adjusted manually), or true (not intended for single agent envs due to individual agents being inactivated in rllib after done = true, but should not cause errors due to the compatibility code in GDRLPettingZooWrapper). SB3 seems to work properly after these changes, but I only tested in on a modified version of the multi-agent env for now (made into a single agent compatible version). Further testing is always welcome, especially on Linux and Sample Factory.

LSTM/Attention wrappers work (they show a deprecated warning so possibly accessing them might be different when newer versions of rllib come out), but for exporting we can't use them yet since the state data wouldn't be fed in.

One thing I found that doesn't work well is enabling some exploration options with PPO, one that worked was RE3 with Tensorflow rather than Torch set. Curiosity needs discrete or multidiscrete actions, but didn't seem to work when I switched the env to discrete actions. I think it might be related to the tuple action space, it might not be supported in some of the exploration codes.

[!WARNING] Edit: With the current script, the exported onnx from Rllib doesn't output just action means like our SB3 setup, so the output size is doubled and exported onnx with more than one action won't work correctly. ~~Not yet sure how to solve that so that both onnx export from sb3 and rllib works with different sizes.~~

Edit2: I've just updated the plugin to handle the case above.

edbeeching / godot_rl_agents

Ready for testing 🧪 Multi-policy training support #181