IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 459 forks source link

integration test changes to reach the train part #254

Closed gal-leibovich closed 5 years ago

gal-leibovich commented 5 years ago

integration test changes to override heatup to 1000 steps + run each preset for 30 sec (to make sure we reach the train part)

scttl commented 5 years ago

this code change seems ok, but looks like it may have uncovered a couple of issues lurking in CartPole_DQN_BatchRL and Pendulum_HAC (see integration test output). Any idea what's going on there @galleibo-intel?

gal-leibovich commented 5 years ago

Yeah seems like it did. The reason for this PR was originally due to issues with the Rainbow DQN agent, which I didn't understand how come didn't the integration tests catch.

Now, there are two interesting things here:

  1. Those two presets did have issues (the BatchRL one only with its testing, and as for Pendulum_HAC it seems like it was broken for a while now). I have fixed both of those issues with the latest commit on this PR.
  2. One thing that I'm curious about is the BatchRL testing issue, which was not uncovered within its PR. The integration test wouldn't catch it, before this current PR, but the golden test that I have added should have failed. It is now failing on master. Any ideas what might be causing this? @scttl @anabwan
anabwan commented 5 years ago

regarding 1, looks OK to me.. regarding 2, I can investigate this issue - very interesting

gal-leibovich commented 5 years ago

While this issue is being debugged, could we approve this PR?

gal-leibovich commented 5 years ago

Yeah this seems to be the same kind of issue as in the BatchRL branch. Unfortunately, I don't know how to fish out the specific job where it was passing.

I have pushed a fix to the new issue that was revealed now, but there now seem to be new issues, this time with the build process (mujoco related?).

scttl commented 5 years ago

Yes, as discussed over on #264 I pushed updated Dockerfiles to fix the Mujoco build issue, and then kicked off a CI re-run on your latest commit. Looks like we're back in business for building but still more to fix on the unit/integration side? Bugfix whack-a-mole continues :)

scttl commented 5 years ago

looks like gym just modified some of their interfaces yesterday and we're now getting hit by: https://github.com/openai/gym/commit/f5d571a16d18d3a64d56819c76e803a89149397c

gal-leibovich commented 5 years ago

Yeah seems like that might be it. I have pushed (yet another) fix.

It is now still failing on golden_test_gym, as I'm guessing that the dockerfile is fixed to some older gym version.

scttl commented 5 years ago

That's right @galleibo-intel in the Dockerfile.gym_environment we were pinning to 0.10.5 gym release but I've just relaxed this given your most recent changes on this branch. I've manually pushed this updated dockerfile and kicked off another run on this branch.

gal-leibovich commented 5 years ago

CartPole_ClippedPPO still shows as failing due to bad address. This seems to be some CI hiccup, as the test passes for me locally. Merging this in.