Closed slinlee closed 2 years ago
I am reopening #509 so that we can test it.
This shouldn't be merged yet. https://test.devpathmind.com/experiment/7277 shows that training with reward terms needs another operator:
base_env.send_actions(actions_to_send)
File "/app/conda/lib/python3.7/site-packages/ray/rllib/env/base_env.py", line 421, in send_actions
obs, rewards, dones, infos = env.step(agent_dict)
File "/app/work/pathmind_training/environments.py", line 281, in step
self.term_contributions_dict.values()
TypeError: unsupported operand type(s) for +: 'int' and 'nativerl.Array'
@slinlee can you please post the full stack trace? I want to understand which line fails
@slinlee can you please post the full stack trace? I want to understand which line fails
@maxpumperla of course, here you go https://gist.github.com/slinlee/0511f0e981e47d6261cf3249c3aa6bcf
Including bg_multi_mini's reward balancing changes made training with reward functions perform worse: https://test.devpathmind.com/sharedExperiment/7273
And using reward terms trains, but also has poor results https://test.devpathmind.com/sharedExperiment/7274
The throughputs are expected to be 60+ but they're in the 10-30s.