comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Logging Issue for multi GPUs #527

Closed AlicanAKCA closed 9 months ago

AlicanAKCA commented 10 months ago

Describe the Bug

Experiment is not being shown if I run the code using 2 GPUs. The log is stucked at the begining of the training. The graphs' metrics aren't sent into comet also.

Expected behavior

A clear and concise description of what you expected to happen.

Where is the issue?

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Stack Trace

If possible please include the full stack trace of your issue here

Ultralytics YOLOv8.0.209 πŸš€ Python-3.10.12 torch-2.0.0 CUDA:0 (Tesla T4, 15110MiB)
                                                       CUDA:1 (Tesla T4, 15110MiB)
engine/trainer: task=segment, mode=train, model=yolov8l-seg.yaml, data=/kaggle/working/dataset/data.yaml, epochs=40, patience=50, batch=16, imgsz=720, save=True, save_period=4, cache=False, device=[0, 1], workers=8, project=None, name=nodule-seg_LARGE_v1.0, exist_ok=False, pretrained=True, optimizer=AdamW, verbose=True, seed=0, deterministic=True, single_cls=True, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=1, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, stream_buffer=False, line_width=None, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.0001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.0, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, tracker=botsort.yaml, save_dir=runs/segment/nodule-seg_LARGE_v1.0
Downloading https://ultralytics.com/assets/Arial.ttf to '/root/.config/Ultralytics/Arial.ttf'...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 755k/755k [00:00<00:00, 25.9MB/s]
2023-11-15 06:44:25,396 INFO util.py:129 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2023-11-15 06:44:26,934 INFO util.py:129 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Overriding model.yaml nc=80 with nc=1

                   from  n    params  module                                       arguments                     
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2]                 
  1                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  2                  -1  3    279808  ultralytics.nn.modules.block.C2f             [128, 128, 3, True]           
  3                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  4                  -1  6   2101248  ultralytics.nn.modules.block.C2f             [256, 256, 6, True]           
  5                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  6                  -1  6   8396800  ultralytics.nn.modules.block.C2f             [512, 512, 6, True]           
  7                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
  8                  -1  3   4461568  ultralytics.nn.modules.block.C2f             [512, 512, 3, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  3   4723712  ultralytics.nn.modules.block.C2f             [1024, 512, 3]                
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  3   1247744  ultralytics.nn.modules.block.C2f             [768, 256, 3]                 
 16                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  3   4592640  ultralytics.nn.modules.block.C2f             [768, 512, 3]                 
 19                  -1  1   2360320  ultralytics.nn.modules.conv.Conv             [512, 512, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  3   4723712  ultralytics.nn.modules.block.C2f             [1024, 512, 3]                
 22        [15, 18, 21]  1   7889779  ultralytics.nn.modules.head.Segment          [1, 32, 256, [256, 512, 512]] 
YOLOv8l-seg summary: 401 layers, 45936819 parameters, 45936803 gradients, 220.8 GFLOPs

DDP: debug command /opt/conda/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 45285 /root/.config/Ultralytics/DDP/_temp_lxyig0id133554746831200.py

Code in the notebook:

!pip install -q gdown
!pip install -q ultralytics
!pip install -q comet_ml
!pip install -q wandb

%env COMET_API_KEY= xxxx
%env WANDB_API_KEY= xxxx

!gdown DATASET_LINK #DATASET.zip

import comet_ml
from comet_ml import Experiment

comet_ml.init()

from ultralytics import YOLO
MODEL_PATH = "yolov8l-seg.yaml"
model = YOLO(MODEL_PATH)

!mkdir dataset

!unzip /kaggle/working/DATASET.zip -d /kaggle/working/DATASET

import os
os.remove("/kaggle/working/DATASET.zip")

with open('/kaggle/working/dataset/data.yaml', 'w') as f:
    f.write("""train : /kaggle/working/dataset/autosplit_train.txt\nval : /kaggle/working/dataset/autosplit_val.txt\nnc : 1\nnames : ['object']""")

!ls /kaggle/working/dataset 

experiment_name  = 'nodule-seg_LARGE_v1.0'
project_name = '2SEGMENT'
experiment = Experiment(
    api_key="xxxx",
    project_name=project_name,
    workspace="segmentations"
)

experiment.set_name(experiment_name)

model.train(data="/kaggle/working/dataset/data.yaml", imgsz = 720, optimizer= 'AdamW',lr0= 0.0001, batch=16,
            name=experiment_name ,mask_ratio= 1,
            save_period=4, single_cls = True, device = [0,1] ,epochs=40, dfl = 1.0)
experiment.end()

Screenshots

1 2

dsblank commented 10 months ago

When you create an Experiment(), can you add the parameter auto_output_logging="simple" like:

experiment = Experiment(
    api_key="xxxx",
    project_name=project_name,
    workspace="segmentations",
    auto_output_logging="simple"
)

and report back if that works?

AlicanAKCA commented 10 months ago

Firstly, thank you for your interest. I just tried but the logs and graphs are the same as before.

dsblank commented 10 months ago

It looks like the YOLO example will create the comet_ml Experiment automatically, so you don't need to make an Experiment. I suspect that you are creating two experiments, and the metrics are going to the second one.

AlicanAKCA commented 10 months ago

I do not think that the code creates another Experiment on its own in the background. GPU utilization metrics were plotted. These screenshots have been taken from another Experiment that also had been run with 2 GPUs: SmartSelect_20231121_202652_Samsung Internet SmartSelect_20231121_202637_Samsung Internet

dsblank commented 10 months ago

What happens if you run the following version:

%env COMET_API_KEY= xxxx
import comet_ml
from comet_ml import Experiment

comet_ml.init()

from ultralytics import YOLO
MODEL_PATH = "yolov8l-seg.yaml"
model = YOLO(MODEL_PATH)

experiment_name  = 'nodule-seg_LARGE_v1.0'
project_name = '2SEGMENT'

model.train(data="/kaggle/working/dataset/data.yaml", imgsz = 720, optimizer= 'AdamW',lr0= 0.0001, batch=16,
            name=experiment_name,mask_ratio= 1,
            save_period=4, single_cls = True, device = [0,1] ,epochs=40, dfl = 1.0)

Does it create a Comet experiment?

AlicanAKCA commented 10 months ago

Yes, it does. But it located in the "General" tab. Also, experiment name is set to comparative_lepton_7102. Lastly, the loggings were sent into this experiment. So, I think it sent the loggings into another experiment that is automatically created by comet.

dsblank commented 10 months ago

This page of documentation might be useful to you as it shows how to set experiment name and other items: https://docs.ultralytics.com/yolov5/tutorials/comet_logging_integration/