DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.25k stars 1.71k forks source link

Tensorboard log file not generated #108

Closed cisprague closed 4 years ago

cisprague commented 4 years ago

Describe the bug In Stable Baseline, if I train sac.SAC with tensorboard_log='./logs/', I get a Tensorboard log in ./logs/SAC_1/. But, in Stable Baselines 3, with the same keyword argument, a Tensorboard log is not generated.

Code example If I run my training script with

    import stable_baselines as sb

    # soft actor critic
    agent = sb.sac.SAC(
        sb.sac.MlpPolicy,
        env,
        tensorboard_log='./logs/',
    )

I get a log at ./logs/SAC_1/events.out.tfevents.1594902758.machinename.

But, if I run the equivalent in this updated version as so

    import stable_baselines3 as sb

    # soft actor critic
    agent = sb.sac.SAC(
        sb.sac.MlpPolicy,
        env,
        tensorboard_log='./logs/',
    )

the previous log generation doesn't happen.

System Info

araffin commented 4 years ago

Hello,

But, in Stable Baselines 3, nothing happens.

What do you mean?

cisprague commented 4 years ago

A TF event log isn't created after starting learning.

araffin commented 4 years ago

Then please fill up the issue template completely.

araffin commented 4 years ago

Using latest (version in master) SB3 (0.8.0a4):

from stable_baselines3 import SAC
model = SAC('MlpPolicy', 'Pendulum-v0', tensorboard_log='/tmp/sb3/').learn(10000)

and then

tensorboard --logdir /tmp/sb3/

I can visualize all the graphs... and the tf event file is there.

Note: in SB3, the tf event is generated only when .learn() is called, the graph visualization was disabled as not so easy to read (see #30 )

cisprague commented 4 years ago

Hi, again. I'm now using PPO and I get the same result. Could there be any other reason that the log file is not generated? I am able to get callbacks working for evaluation and saving, but I still can't get the Tensorboard log to be generated.

araffin commented 4 years ago

Did you try with the code i wrote? (at least as a sanity check)

cisprague commented 4 years ago

Did you try with the code i wrote? (at least as a sanity check)

Yup, I don't get a Tensorboard file with that either. :/ Could it be something to do with other libraries?

cisprague commented 4 years ago

Got it! I needed to uninstall my current version, then reinstall with the [extra] tag. Thanks for your help.

johanmonard commented 3 years ago

Hello, your solution does not sort the issue on my side.

I was running a training script (SB3/PPO) on google Colab (installing SB3 without the [EXTRA] tag) and everything was working perfectly. Then I migrated to Microsoft Azure VM and impossible to get the tensorboard logs written to the tensorboard_log directory although my script is exactly the same. I tried installing SB3 with the [EXTRA] tag but it did not help.

Here are the relevant parts of the script which is located in the root_path folder:

root_path = os.sys.path[0]
logs_path = os.path.join(root_path, 'logs')
logs_path_eval = os.path.join(root_path, 'logs_eval')

env = make_vec_env('my_custom_env', n_envs=16, env_kwargs=env_kw)
eval_env = gym.make('my_custom_env', **env_kw)

model = PPO(
    policy = CnnPolicy,
    env = env, 
    verbose = 1, 
    tensorboard_log = logs_path,
    create_eval_env = True,
    **ppo_kw,
    )
model.learn(
    total_timesteps = 100000, 
    tb_log_name = 'agent_test',
    eval_env = eval_env,
    eval_freq = 10000, 
    n_eval_episodes = 10, 
    eval_log_path = logs_path_eval,
    reset_num_timesteps = False,
)

Any idea?

Miffyli commented 3 years ago

@johanmonard

You are storing logs to directory under os.sys.path[0], not local directory. I guess it is common for . to be first item in PATH on Linux, but there is no rule that says it has to be so. On my Windows 10 os.sys.path[0] is C:\\Users\\Anssi\\AppData\\Local\\Programs\\Python\\Python37\\Scripts.

Change first line to root_path = "." or root_path = os.curdir() and that should do the trick.

johanmonard commented 3 years ago

@Miffyli

Still the same problem, note that even with os.sys.path[0], other folders and path I am using are working perfectly:

agents_path = os.path.join(root_path, 'agents')
logs_path = os.path.join(root_path, 'logs')
logs_path_eval = os.path.join(root_path, 'logs_eval')

I save agents periodically in agents_path and it works, and use logs_path_eval in model.learn(eval_log_path=logs_path_eval) and it does save the best model there so there must be something else but I don't any error message...

Miffyli commented 3 years ago

Ah, sorry that I missed this!

Hmm I would confirm that the code first works on local machines (outside colab). If that does not work it might be a bug somewhere (or something else in the code interferes with things). If the directory were not writeable it should throw an exception. If code works locally then something in Azure is interfering, and that sadly goes beyond these issues :/.

araffin commented 3 years ago

Hello, your solution does not sort the issue on my side.

I was running a training script (SB3/PPO) on google Colab (installing SB3 without the [EXTRA] tag) and everything was working perfectly. Then I migrated to Microsoft Azure VM and impossible to get the tensorboard logs written to the tensorboard_log directory although my script is exactly the same. I tried installing SB3 with the [EXTRA] tag but it did not help.

Here are the relevant parts of the script which is located in the root_path folder:

root_path = os.sys.path[0]
logs_path = os.path.join(root_path, 'logs')
logs_path_eval = os.path.join(root_path, 'logs_eval')

env = make_vec_env('my_custom_env', n_envs=16, env_kwargs=env_kw)
eval_env = gym.make('my_custom_env', **env_kw)

model = PPO(
    policy = CnnPolicy,
    env = env, 
    verbose = 1, 
    tensorboard_log = logs_path,
    create_eval_env = True,
    **ppo_kw,
    )
model.learn(
    total_timesteps = 100000, 
    tb_log_name = 'agent_test',
    eval_env = eval_env,
    eval_freq = 10000, 
    n_eval_episodes = 10, 
    eval_log_path = logs_path_eval,
    reset_num_timesteps = False,
)

Any idea?

What version of tensorboard do you have?

nmerrill67 commented 3 years ago

The issue is here. SB3 tries to import tensordboard, and if it is not installed, it just silently ignores it. There should definitely be a log warning if the import failed and the user tries to write a tensorboard log. Just make sure tensorboard is actually installed on your system.

Miffyli commented 3 years ago

Good catch! User definitely should be informed about this (tensorboard_log_path is not None but SummaryWriter is None).

@araffin Maybe something you could fix in #469 as it touches the same parts of code?

araffin commented 3 years ago

@araffin Maybe something you could fix in #469 as it touches the same parts of code?

done ;) It will now throw an ImportError.