PathmindAI / nativerl

Train reinforcement learning agents using AnyLogic or Python-based simulations
Apache License 2.0
19 stars 4 forks source link

WIP: Jd policy exports #380

Closed johnnyL7 closed 3 years ago

johnnyL7 commented 3 years ago

Callbacks will generate an agent that, restored from the nearest checkpoint, and export to a the expected model format in each of the trials.

ejunprung commented 3 years ago

Hm, have you had a chance to test this code btw?

It looks like it only grabs the best checkpoint in the entire experiment (not one for each trial and not the latest), and I don't think you need to restore the agent if the training hasn't been killed yet.

Also, when is export_policy_from_checkpoint called?

slinlee commented 3 years ago

I think it's fine if it uses get_best_checkpoint and only saves the 'best' from all the trials up until this point. We only need to save one and overwrite it if there's a new and better checkpoint as training continues.

Then the webapp can show the 'download policy' button when a policy is available, even while training continues.

Small things:

slinlee commented 3 years ago

@ejunprung

Also, when is export_policy_from_checkpoint called?

I think @johnnyL7 still needs to hook up the callback. Since he marked this as WIP I'm converting it to a draft too

ejunprung commented 3 years ago

Oh, I see. I'll check again later when Johnny is done.

slinlee commented 3 years ago

@johnnyL7 @brettskymind is this meant to go into #372 or #262 ?

johnnyL7 commented 3 years ago

@ejunprung I finished plugging it in so it should overwrite the best checkpoint at each checkpoint_iteration (+1 iteration just to make sure all checkpoints are properly saved before the agent is loaded)

Instead of reinstantiating the agent, I access the one available thru callbacks, which should save some time.

I tested the code and it works on the AGV model locally, but runs out of memory even with 1 PBT trial. This shouldnt be a problem in the web app where we can scale the memory, but makes it harder to test.

As for connecting it to the webapp, we can make a experiment_progress_report.txt that summarizes the best trial where the best checkpoint model up to now can be found.

Let me know what you think, @slinlee

ejunprung commented 3 years ago

@slinlee I think it's unrelated to reward balancing. This is for grabbing policies at any checkpoint so you don't have to wait until training finishes to test the policy.

@johnnyL7 Nice! I'll try it tomorrow. I don't think there should be memory issues (even locally) so that is weird.

slinlee commented 3 years ago

@ejunprung yeah but the pull request is pointing at a really old balancing branch

ejunprung commented 3 years ago

Oh lol. Dude @johnnyL7 you should create a fresh branch.

johnnyL7 commented 3 years ago

@slinlee I am a bit confused now on which is the old one: bg_balance or bg_reward_blancing. I thought bg_balance was the most recent branch Brett and I worked on to merge into test, so I opened this PR to go into bg_balance

Please let me know where I've gone wrong along the way, and sorry for the mixup! lol

slinlee commented 3 years ago

@johnnyL7 - usually pull requests go to the default branch. So in nativerl, it will go to dev

From there we can merge or release it to 'test' , staging or prod.

If we base this PR on the rebalancing PR then this change won't be released until the rebalancing one is done, reviewed, and then tested. I don't think these two features are dependant on each other so basing the PR on dev will let us release this more quickly.

Let me know if that makes sense.

johnnyL7 commented 3 years ago

Ah yes that makes sense, I guess my rationale was that bg_balance has the newest features that are going to be merged into TEST very soon, so I wanted to grab the "latest" from TEST to branch off of.

It makes sense what you said because it's not guaranteed that bg_balnace is ready yet, so better to branch from dev. I will rebase and push the commit

slinlee commented 3 years ago

@johnnyL7 I created #383 which I think covers the changes for saving the best policy during training. If that looks good, we can replace this PR with that one.

cc @ejunprung

ejunprung commented 3 years ago

Closing b/c we separated into a different PR. https://github.com/SkymindIO/nativerl/pull/383