PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.29k stars 5.61k forks source link

paddle 分布式训练时, Visualdl / Tensorboardx 等可视化记录 Loss/Acc ,代码崩溃, log目录错误。 #56200

Open MingkaiSheng opened 1 year ago

MingkaiSheng commented 1 year ago

bug描述 Describe the Bug

Version: paddlepaddle-gpu 2.5.1.post117 visualdl 2.4.2

情况说明: 使用 fleet API,4 卡分布式训练时候,visualdl / tensorboardx 记录训练 acc/loss。 执行到代码出 logwriter 处,报错:FileExistsError:

[2023-08-11 08:09:01,447] [ WARNING] fleet.py:290 - The dygraph parallel environment has been initialized.
[2023-08-11 08:09:01,448] [ WARNING] fleet.py:313 - The dygraph hybrid parallel environment has been initialized.
Traceback (most recent call last):
  File "main.py", line 23, in <module>
    logwriter = LogWriter(logdir='./runs/')
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/writer.py", line 120, in __init__
    self._get_file_writer()
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/writer.py", line 135, in _get_file_writer
    self._file_writer = RecordFileWriter(
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/writer/record_writer.py", line 90, in __init__
    bfile.makedirs(logdir)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/io/bfile.py", line 695, in makedirs
    return default_file_factory.get_filesystem(path).makedirs(path)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/site-packages/visualdl/io/bfile.py", line 97, in makedirs
    os.makedirs(path)
  File "/home/smk/anaconda3/envs/paddle/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: './runs/'
LAUNCH INFO 2023-08-11 08:09:04,712 Exit code -15

其他补充信息 Additional Supplementary Information

No response

MingkaiSheng commented 1 year ago

没人来处理这个问题么?

w5688414 commented 1 year ago

把那个目录删除呢?或者定向到其他目录?

MingkaiSheng commented 1 year ago

没有用的。尝试过来。删除也没用。默认参数也是报错的。

w5688414 commented 1 year ago
os.makedirs(path,exist_ok=True)
MingkaiSheng commented 1 year ago
os.makedirs(path,exist_ok=True)

ok。感谢。