Closed ohernpaul closed 2 years ago
Hi @ohernpaul, This is a lot to take in so let me make sure I understand you correctly before moving forward. From what I've read, you are launching sequential training steps in one process without shutting down the previous training process. You are doing this by creating RunOption instances with updated information you want to use for your next phase of training.
Does that sound accurate to you?
Hi @ohernpaul, This is a lot to take in so let me make sure I understand you correctly before moving forward. From what I've read, you are launching sequential training steps in one process without shutting down the previous training process. You are doing this by creating RunOption instances with updated information you want to use for your next phase of training.
Does that sound accurate to you?
Chris, thanks for the quick response. Yes, sequential training sessions where each session uses initialize-from to start the new phase with knowledge from previous phase. I hacked together this automation pipeline using the RunOptions and run_training suggestions from my previous post about hyper parameters. My guess is that the session termination is not clearing correctly.
Outline:
I can provide more info if needed!
I've been watching training progress from phase to phase and it seems to be learning correctly and that the problem is probably related only to tensorboard (statswriter?)
Update: My temporary fix is copying the previous phase result directory to a new location before the next phase starts training and overwritting the new result values.
Cool, thanks for clarifying. I'm talking about this with the team. We will get back to you soon.
I reproduced the issue using 3DBall if this will help.
I also stepped through with a debugger and found that the env_manager.close() connects to a unity environment and brain before closing.
class PhaseLauncher():
def __init__(self):
##########################################################
self.wkspace_dir = 'path\\to\\wkspace\\'
self.mlagents_dir = self.wkspace_dir + 'ml-agents-master\\'
self.config_dir = self.mlagents_dir + 'config\\'
self.builds_dir = self.mlagents_dir + 'builds\\'
self.results_dir = self.mlagents_dir + 'results\\reproduce_test\\'
self.results_dir_cpy = self.mlagents_dir + 'results\\reproduce_test_copy\\'
##########################################################
self.phase = 0
self.run_id = '3dball_ppo'
self.quality = 1
self.height = 300
self.width = 300
self.no_graphics = False
self.use_init_from = False
self.use_env = True
self.nb_envs = 5
self.do_inference = False
self.init_from = ''
self.loop_counter = 0
self.seed = 0
self.phase_config_dir = self.config_dir + 'debug_phases\\'
self.runs_array = []
##########################################################
def Start(self):
#loop to represent phases
for i in range(2):
#call run_training and pass the run_options object
print("---Phase Start---")
if self.loop_counter == 0:
self.seed = 1337
self.run_id = self.run_id + '_' + str(self.seed)
else:
self.seed = 101
self.run_id = ''.join([self.run_id.split('_')[0], '_' + str(self.seed)])
self.GetRunOptions(self.phase_config_dir + '3DBall.yaml')
##########################################################
def GetRunOptions(self, phase_config_path):
"""
RUN OPTIONS SECTION
(most pulled from settings.py in mlagents/trainers)
"""
#==================================================
print("---Building Run Options---")
#Define Config Dict
configured_dict: Dict[str, Any] = {
"checkpoint_settings": {},
"env_settings": {},
"engine_settings": {},
"torch_settings": {},
}
#fill dict with params defined in yaml file
configured_dict.update(load_config(phase_config_path))
#==================================================
#Fill what would be CLI args with values defined in script
configured_dict["checkpoint_settings"]['run_id'] = self.run_id
configured_dict["checkpoint_settings"]['results_dir'] = self.results_dir
configured_dict["checkpoint_settings"]['force'] = True
configured_dict["checkpoint_settings"]['inference'] = self.do_inference
configured_dict["engine_settings"]['width'] = self.width
configured_dict["engine_settings"]['height'] = self.height
configured_dict["engine_settings"]['quality_level'] = self.quality
configured_dict["engine_settings"]['no_graphics'] = self.no_graphics
configured_dict["env_settings"]['env_path'] = self.builds_dir + '3DBall'
configured_dict["env_settings"]['num_envs'] = self.nb_envs
#==================================================
final_runoptions = RunOptions.from_dict(configured_dict)
self.RunTraining(final_runoptions)
##########################################################
def RunTraining(self, run_options):
#run_training(self.seed, run_options)
run_cli(run_options)
shutil.copytree(self.results_dir + self.run_id, self.results_dir_cpy + self.run_id)
time.sleep(2)
##########################################################
if __name__ == "__main__":
pl = PhaseLauncher()
pl.Start()
Hey @ohernpaul, We have logged this issue internally as MLA-812 and will update this thread once work is complete on it. Thank you for your very detailed feedback!
This issue has been automatically marked as stale because it has not had activity in the last 28 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 42 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Sorry in advance for all the long logs. And thank you for any help/insights!
Describe the bug Important: I am running a custom script patched together with mlagents python functions. This script is being run in an IDE (spyder) and works without issues described bellow when the phases are launched from different kernels (ie kernel 1 finishes phase 1 -> kill kernel 1 -> start kernel 2 on phase 2 ==> no tensorboard graph issues).
A clear and concise description of what the bug is. I have been building out what I call a "PhaseLauncher" which is an automated way of launching training sessions that use different config files (yaml) in a sequential way. Primarily used so I don't have to manually launch runs with initialize-from.
The code works in general, but I am seeing some issues with how tensorboard is updating the graphs. As the first phase (no initialize-from) completes, the graph looks perfectly normal, but as the new phase's graph is updated (through summary freq), the old graph gets over written. See picture bellow.
I have done some debugging by stepping through the entire process and believe that the issue is within SubprocessEnvManager.
Bellow are some logs from the training session. The first glaring issue you can see is that in the first run the program establishes connection to a brain based on num-envs. The hyper params from runoptions are printed (once) and the training starts. When the training ends it first says connected to new brain (again), then shuts down the env. Reconnection before run_training or run_cli seems fishy.
Next: I create a new runoptions object from a new config file, define the initialize-from with the previous runid, then start training again. The issues I see here in the logs are: connection to unity env, connection to new brain, env shut down, connection to unity env, (run_cli) prints 2 duplicates of the hyper parameters from the phase2 config file. The number of duplicates is exactly the same as the number of phases. So phase 3 will print 3 duplicates of the hyper parameters.
In general it seems to work, but my fear is that the onnx models are being overwritten too. I have not tested this yet.
Console logs / stack traces Please wrap in triple backticks (```) to make it easier to read.
PHASE 1 Launch:
PHASE 1 Termination:
PHASE 2 Launch:
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
NOTE: We are unable to help reproduce bugs with custom environments. Please attempt to reproduce your issue with one of the example environments, or provide a minimal patch to one of the environments needed to reproduce the issue.