Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
300 stars 29 forks source link

SheepRL Dreamer v3 - ValueError #309

Open ogulcankertmen opened 2 months ago

ogulcankertmen commented 2 months ago

I tried; "sheeprl exp=dreamer_v3 env=gym env.id=CartPole-v1" this one and i got "ValueError: you tried to log -1 which is currently not supported. Try a dict or a scalar/tensor.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace." these results. And I couldn't solve this problem. Do you have any suggestions?

belerico commented 2 months ago

Hi @ogulcankertmen, I've tried the exact same command on the main branch on my machine and the training goes well: can you please share more info about the error? Maybe the entire stacktrace?

ogulcankertmen commented 2 months ago

@belerico Here is the stacktrace;

C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\sheeprl\utils\logger.py:22: UserWarning: The specified root directory for the TensorBoardLogger is different from the experiment one, so the logger one will be ignored and replaced with the experiment root directory
  warnings.warn(
2024-07-11 11:20:41.583052: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-11 11:20:43.817109: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Error executing job with overrides: ['exp=dreamer_v3', 'env=gym', 'env.id=CartPole-v1']
Traceback (most recent call last):
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\loggers\tensorboard.py", line 215, in log_metrics
    self.experiment.add_scalar(k, v, step)
    ^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\loggers\logger.py", line 118, in experiment
    return fn(self)
           ^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\loggers\tensorboard.py", line 197, in experiment
    self._experiment = SummaryWriter(log_dir=self.log_dir, **self._kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\tensorboard\writer.py", line 249, in __init__
    self._get_file_writer()
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\tensorboard\writer.py", line 281, in _get_file_writer
    self.file_writer = FileWriter(
                       ^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\tensorboard\writer.py", line 75, in __init__
    self.event_writer = EventFileWriter(
                        ^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorboard\summary\writer\event_file_writer.py", line 72, in __init__
    tf.io.gfile.makedirs(logdir)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorflow\python\lib\io\file_io.py", line 513, in recursive_create_dir_v2
    _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.FailedPreconditionError: logs\runs\dreamer_v3/CartPole-v1\2024-07-11_11-20-40_dreamer_v3_CartPole-v1_42 is not a directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\sheeprl\cli.py", line 366, in run
    run_algorithm(cfg)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\sheeprl\cli.py", line 199, in run_algorithm
    fabric.launch(reproducible(command), cfg, **kwargs)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\sheeprl\cli.py", line 195, in wrapper
    return func(fabric, cfg, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\sheeprl\algos\dreamer_v3\dreamer_v3.py", line 379, in main
    fabric.logger.log_hyperparams(cfg)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\utilities\rank_zero.py", line 70, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\loggers\tensorboard.py", line 249, in log_hyperparams
    self.log_metrics(metrics, 0)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\utilities\rank_zero.py", line 70, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\lightning\fabric\loggers\tensorboard.py", line 218, in log_metrics
    raise ValueError(
ValueError:
 you tried to log -1 which is currently not supported. Try a dict or a scalar/tensor.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

There is config information on them.

belerico commented 2 months ago

Hi @ogulcankertmen, I've tried on my windows machine and nothing happens: I'm not able to replicate. Could you please share also your env? I've seen from your error that the log_dir path has mixed separators: I've created a branch where we normalize the separators on windows. Can you try it? Also: why torch tensorboard is calling tensorflow to create the logdirs? I'm referring to this line:

  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\tensorboard\writer.py", line 75, in __init__
    self.event_writer = EventFileWriter(
                        ^^^^^^^^^^^^^^^^
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorboard\summary\writer\event_file_writer.py", line 72, in __init__
    tf.io.gfile.makedirs(logdir)
  File "C:\Users\Oğulcan\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorflow\python\lib\io\file_io.py", line 513, in recursive_create_dir_v2
    _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))

What happens if you remove tensorflow?

A similar issue: https://github.com/tensorflow/tensorflow/issues/60682#issuecomment-1561899350