Open AlbertZhangHIT opened 9 months ago
you can add the worker_name on the torch_npu.profiler.tensorboard_trace_handler, on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(dir_name=os.path.join(self.args.output_ckpt_path, "profiling"), workername="rank"+str(torch.distributed.get_rank()))
When profiling NPUs in multi-machine scenario, the error failing to make directory for storing tracing data occured.
Environment:
Snipes:
Errors:
It is weird that if I set
skip_first
to 0, the error disappeared.I also found that there may be a bug in creating directories here. The function
make_dir_safety
may not be safe especially in multi-threads case. We should at least addexist_ok=True
when usingos.makedirs
to avoid potential errors.