Ascend / pytorch

Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch
https://ascend.github.io/docs/
Other
259 stars 15 forks source link

RuntimeError: Failed to make directory fo profiling #22

Open AlbertZhangHIT opened 9 months ago

AlbertZhangHIT commented 9 months ago

When profiling NPUs in multi-machine scenario, the error failing to make directory for storing tracing data occured.

Environment:

OS: ubuntu 20.04
Arch: aarch64
Python: 3.10
torch: 2.1.0
torch-npu: 2.1.0

Snipes:

        with torch_npu.profiler.profile(
            activities=[torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU],
            schedule=torch_npu.profiler.schedule(wait=1, warmup=2, active=5, skip_first=100),
            on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(dir_name=os.path.join(self.args.output_ckpt_path, "profiling")),
            profile_memory=True,
            record_shapes=True,
            with_stack=True,
            experimental_config=torch_npu.profiler._ExperimentalConfig(
                profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
                aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
                l2_cache=False,
                data_simplification=False),
        ) as profiler:

Errors:

2024-02-23 12:31:32 [WARNING] [332] profiler.py: Incorrect schedule: WARMUP followed by NONE
Traceback (most recent call last):
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/utils/path_manager.py", line 134, in make_dir_safety
    os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY)
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/job/output/profiling/Euler_332_20240223123131.626_ascend_pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/job/file/pretrain_profiling.py", line 256, in train
    profiler.step()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler.py", line 79, in step
    self._action_controller.transit_action()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 90, in transit_action
    action()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 96, in init
    path = self._on_trace_ready.create_prof_dir()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 36, in create_prof_dir
    PathManager.make_dir_safety(total_path)
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/utils/path_manager.py", line 136, in make_dir_safety
    raise RuntimeError(msg) from err
RuntimeError: Failed to make directory: /job/output/profiling/Euler_332_20240223123131.626_ascend_pt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/job/file/pretrain_profiling.py", line 354, in <module>
    main()
  File "/job/file/pretrain_profiling.py", line 350, in main
    train_and_validate(args)
  File "/job/file/pretrain_profiling.py", line 152, in train_and_validate
    trainer.run(args.epochs)
  File "/job/file/pretrain_profiling.py", line 186, in run
    self.train()
  File "/job/file/pretrain_profiling.py", line 189, in train
    with torch_npu.profiler.profile(
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler.py", line 70, in __exit__
    self._action_controller.transit_action()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 90, in transit_action
    action()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/profiler_action_controller.py", line 103, in start_prof
    self._msprofiler_interface.start_profiler()
  File "/home/HwHiAiUser/anaconda3/lib/python3.10/site-packages/torch_npu/profiler/msprofiler_c_interface.py", line 51, in start_profiler
    torch_npu._C._profiler._start_profiler(self.msprof_config, self.activities)
TypeError: _start_profiler(): incompatible function arguments. The following argument types are supported:
    1. (config: torch_npu._C._profiler.NpuProfilerConfig, activities: Set[torch_npu._C._profiler.ProfilerActivity], scopes: Set[torch._C._profiler.RecordScope] = set()) -> None

It is weird that if I set skip_first to 0, the error disappeared.

I also found that there may be a bug in creating directories here. The function make_dir_safety may not be safe especially in multi-threads case. We should at least add exist_ok=True when using os.makedirs to avoid potential errors.

yunyiyun commented 8 months ago

you can add the worker_name on the torch_npu.profiler.tensorboard_trace_handler, on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(dir_name=os.path.join(self.args.output_ckpt_path, "profiling"), workername="rank"+str(torch.distributed.get_rank()))