Project-MONAI / model-zoo

MONAI Model Zoo that hosts models in the MONAI Bundle format.
Apache License 2.0
186 stars 68 forks source link

multi-gpu tensorboard handlers initialization #520

Open wyli opened 12 months ago

wyli commented 12 months ago

https://github.com/Project-MONAI/model-zoo/blob/cf5e0322ee25b178b6cf841f3bd81e0a8adf2b16/models/spleen_ct_segmentation/configs/multi_gpu_train.json#L18

the multi-gpu override essentially set the trainer handlers to $@train#handlers[:-2] for the worker nodes. but because of the @train#handlers reference, the config parser will still trigger handler constructor calls on all nodes.

for tensorboard handlers this will be an issue, as each constructor call will create a new event log file. as a result the multinode log will have unnecessary event logging files. https://github.com/Project-MONAI/MONAI/blob/e36982b87bf87fb9559fc4d124e132b67f177d23/monai/handlers/tensorboard_handlers.py#L52-L55

wyli commented 11 months ago

a possible fix is to introduce a flag:

diff --git a/configs/multi_gpu_train.json b/configs/multi_gpu_train.json
index ea41b9f..f323b02 100644
--- a/configs/multi_gpu_train.json
+++ b/configs/multi_gpu_train.json
@@ -1,5 +1,6 @@
 {
     "device": "$torch.device('cuda:' + os.environ['LOCAL_RANK'])",
+    "use_tensorboard": "$dist.get_rank() == 0",
     "network": {
         "_target_": "torch.nn.parallel.DistributedDataParallel",
         "module": "$@network_def.to(@device)",
diff --git a/configs/train.json b/configs/train.json
index 7c866fe..80f15d3 100644
--- a/configs/train.json
+++ b/configs/train.json
@@ -10,6 +10,7 @@
     "output_dir": "$@bundle_root + '/eval'",
     "data_list_file_path": "$@bundle_root + '/msd_task09_spleen_folds.json'",
     "dataset_dir": "/data/Task09_Spleen",
+    "use_tensorboard": true,
     "finetune": false,
     "finetune_model_path": "$@bundle_root + '/models/model.pt'",
     "early_stop": false,
@@ -191,6 +192,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
+                "_disabled_": "$not @use_tensorboard",
                 "log_dir": "@output_dir",
                 "tag_name": "train_loss",
                 "output_transform": "$monai.handlers.from_engine(['loss'], first=True)"
@@ -279,6 +281,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
+                "_disabled_": "$not @use_tensorboard",
                 "log_dir": "@output_dir",
                 "iteration_log": false
             },
yiheng-wang-nv commented 11 months ago

Thanks @wyli . I will take a look at this issue and your suggestion. Or @KumoLiu , if you have time could you please help to address it? Can check with the deepedit bundle first.

cc @Nic-Ma