Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.29k stars 3.38k forks source link

Exception: The wandb backend process has shutdown #11373

Closed lzziuhh closed 2 years ago

lzziuhh commented 2 years ago

๐Ÿ› Bug

Description

I try the YOLO5 evolve to find the best hyperparameter. the error raised at the third generations. I retried but happen again so I think it should be a bug.

Environment

Colab pro+

Error Log

wandb: Currently logged in as: lzziuhh (use `wandb login --relogin` to force relogin)
train: weights=yolov5s.pt, cfg=, data=data.yaml, hyp=/kaggle/working/custom.yaml, epochs=10, batch_size=20, imgsz=1280, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=300, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 โœ…
YOLOv5 ๐Ÿš€ v6.0-184-g6865d19 torch 1.10.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16281MiB)

hyperparameters: anchor_t=4.0, box=0.05, cls=0.5, cls_pw=1.0, copy_paste=0.0, degrees=0.0, fl_gamma=0.0, fliplr=0.5, flipud=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.3, iou_t=0.2, lr0=0.01, lrf=0.1, mixup=0.5, momentum=0.937, mosaic=0.5, obj=1.0, obj_pw=1.0, perspective=0.0, scale=0.7, shear=0.0, translate=0.1, warmup_bias_lr=0.1, warmup_epochs=5.0, warmup_momentum=0.8, weight_decay=0.0005, anchors=3
wandb: Tracking run with wandb version 0.12.9
wandb: Syncing run floral-field-13
wandb: โญ๏ธ View project at https://wandb.ai/lzziuhh/evolve
wandb: ๐Ÿš€ View run at https://wandb.ai/lzziuhh/evolve/runs/gp8zuslt
wandb: Run data is saved locally in /kaggle/working/yolov5/wandb/run-20220109_022359-gp8zuslt
wandb: Run `wandb offline` to turn off syncing.

Overriding model.yaml nc=80 with nc=1
Overriding model.yaml anchors with anchors=3

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     16182  models.yolo.Detect                      [1, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Model Summary: 270 layers, 7022326 parameters, 7022326 gradients, 15.8 GFLOPs

Transferred 342/349 items from yolov5s.pt
Scaled weight_decay = 0.00046875
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
albumentations: version 1.0.3 required by YOLOv5, but version 0.1.12 is currently installed
train: Scanning '/kaggle/working/COTS/labels/train.cache' images and labels... 510 found, 0 missing, 0 empty, 0 corrupted: 100% 510/510 [00:00<?, ?it/s]
train: Caching images (1.4GB ram): 100% 510/510 [00:02<00:00, 196.52it/s]
val: Scanning '/kaggle/working/COTS/labels/valid.cache' images and labels... 786 found, 0 missing, 0 empty, 0 corrupted: 100% 786/786 [00:00<?, ?it/s]

AutoAnchor: 0.00 anchors/target, 0.000 Best Possible Recall (BPR). Anchors are a poor fit to dataset โš ๏ธ, attempting to improve...
AutoAnchor: Running kmeans for 9 anchors on 1522 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.8833: 100% 1000/1000 [00:08<00:00, 113.32it/s]
AutoAnchor: thr=0.25: 1.0000 best possible recall, 8.01 anchors past thr
AutoAnchor: n=9, img_size=1280, metric_all=0.560/0.883-mean/best, past_thr=0.606-mean: 24,26, 33,30, 33,38, 43,42, 55,37, 45,57, 69,61, 114,89, 196,156
AutoAnchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.
Image sizes 1280 train, 1280 val
Using 8 dataloader workers
Logging results to runs/evolve/exp2
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     15.2G    0.1193   0.08516         0        49      1280: 100% 26/26 [00:41<00:00,  1.61s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     15.2G     0.115    0.0648         0        44      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     15.2G    0.1081    0.0637         0        67      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     15.2G    0.1055   0.07214         0        67      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       4/9     15.2G    0.1013    0.0729         0        49      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       5/9     15.2G    0.0985   0.07702         0        41      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       6/9     15.2G   0.09669   0.07674         0        42      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       7/9     15.2G   0.09477   0.07369         0        47      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       8/9     15.2G    0.0935   0.07431         0        61      1280: 100% 26/26 [00:41<00:00,  1.58s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       9/9     15.2G   0.09261   0.07535         0        35      1280: 100% 26/26 [00:40<00:00,  1.58s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 20/20 [00:27<00:00,  1.37s/it]
                 all        786       4217     0.0326     0.0849     0.0141    0.00382

10 epochs completed in 0.123 hours.

wandb: Waiting for W&B process to finish, PID 20866... (success).
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:   metrics/mAP_0.5:0.95 โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:      metrics/precision โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:         metrics/recall โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:         train/box_loss โ–ˆโ–‡โ–…โ–„โ–ƒโ–ƒโ–‚โ–‚โ–โ–
wandb:         train/cls_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–
wandb:         train/obj_loss โ–ˆโ–โ–โ–„โ–„โ–…โ–…โ–„โ–„โ–…
wandb:           val/box_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:           val/cls_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–
wandb:           val/obj_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆ
wandb:                  x/lr0 โ–โ–„โ–†โ–‡โ–ˆโ–ˆโ–†โ–…โ–ƒโ–‚
wandb:                  x/lr1 โ–โ–„โ–†โ–‡โ–ˆโ–ˆโ–†โ–…โ–ƒโ–‚
wandb:                  x/lr2 โ–ˆโ–‡โ–‡โ–†โ–…โ–„โ–ƒโ–ƒโ–‚โ–
wandb: 
wandb: Run summary:
wandb:             best/epoch 9
wandb:           best/mAP_0.5 0.01409
wandb:      best/mAP_0.5:0.95 0.00382
wandb:         best/precision 0.03257
wandb:            best/recall 0.08489
wandb:        metrics/mAP_0.5 0.01409
wandb:   metrics/mAP_0.5:0.95 0.00382
wandb:      metrics/precision 0.03257
wandb:         metrics/recall 0.08489
wandb:         train/box_loss 0.09261
wandb:         train/cls_loss 0.0
wandb:         train/obj_loss 0.07535
wandb:           val/box_loss 0.08265
wandb:           val/cls_loss 0.0
wandb:           val/obj_loss 0.16213
wandb:                  x/lr0 0.00032
wandb:                  x/lr1 0.00032
wandb:                  x/lr2 0.07442
wandb: 
wandb: Synced 5 W&B file(s), 32 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced floral-field-13: https://wandb.ai/lzziuhh/evolve/runs/gp8zuslt
wandb: Find logs at: ./wandb/run-20220109_022359-gp8zuslt/logs/debug.log
wandb: 
wandb: Tracking run with wandb version 0.12.9
wandb: Syncing run happy-wave-17
wandb: โญ๏ธ View project at https://wandb.ai/lzziuhh/evolve
wandb: ๐Ÿš€ View run at https://wandb.ai/lzziuhh/evolve/runs/grrpvh1q
wandb: Run data is saved locally in /kaggle/working/yolov5/wandb/run-20220109_023148-grrpvh1q
wandb: Run `wandb offline` to turn off syncing.

Results saved to runs/evolve/exp2
evolve:    metrics/precision,       metrics/recall,      metrics/mAP_0.5, metrics/mAP_0.5:0.95,         val/box_loss,         val/obj_loss,         val/cls_loss,             anchor_t,                  box,                  cls,               cls_pw,           copy_paste,              degrees,             fl_gamma,               fliplr,               flipud,                hsv_h,                hsv_s,                hsv_v,                iou_t,                  lr0,                  lrf,                mixup,             momentum,               mosaic,                  obj,               obj_pw,          perspective,                scale,                shear,            translate,       warmup_bias_lr,        warmup_epochs,      warmup_momentum,         weight_decay,              anchors
evolve:             0.032571,             0.084894,              0.01409,            0.0038155,             0.082646,              0.16213,                    0,                    4,                 0.05,                  0.5,                    1,                    0,                    0,                    0,                  0.5,                    0,                0.015,                  0.7,                  0.3,                  0.2,                 0.01,                  0.1,                  0.5,                0.937,                  0.5,                    1,                    1,                    0,                  0.7,                    0,                  0.1,                  0.1,                    5,                  0.8,               0.0005,                    3

hyperparameters: anchor_t=4.0, box=0.04921, cls=0.5, cls_pw=1.0, copy_paste=0.0, degrees=0.0, fl_gamma=0.0, fliplr=0.5, flipud=0.0, hsv_h=0.01545, hsv_s=0.68469, hsv_v=0.3, iou_t=0.2, lr0=0.01, lrf=0.0972, mixup=0.45586, momentum=0.937, mosaic=0.52414, obj=0.90521, obj_pw=1.0, perspective=0.0, scale=0.7, shear=0.0, translate=0.10506, warmup_bias_lr=0.10589, warmup_epochs=4.91934, warmup_momentum=0.7835, weight_decay=0.00051, anchors=3.0
Overriding model.yaml nc=80 with nc=1
Overriding model.yaml anchors with anchors=3.0

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     16182  models.yolo.Detect                      [1, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Model Summary: 270 layers, 7022326 parameters, 7022326 gradients, 15.8 GFLOPs

Transferred 342/349 items from yolov5s.pt
Scaled weight_decay = 0.00047812500000000003
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
albumentations: version 1.0.3 required by YOLOv5, but version 0.1.12 is currently installed
train: Scanning '/kaggle/working/COTS/labels/train.cache' images and labels... 510 found, 0 missing, 0 empty, 0 corrupted: 100% 510/510 [00:00<?, ?it/s]
train: Caching images (1.4GB ram): 100% 510/510 [00:02<00:00, 200.78it/s]
val: Scanning '/kaggle/working/COTS/labels/valid.cache' images and labels... 786 found, 0 missing, 0 empty, 0 corrupted: 100% 786/786 [00:00<?, ?it/s]

AutoAnchor: 0.00 anchors/target, 0.000 Best Possible Recall (BPR). Anchors are a poor fit to dataset โš ๏ธ, attempting to improve...
AutoAnchor: Running kmeans for 9 anchors on 1522 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.8833: 100% 1000/1000 [00:08<00:00, 117.02it/s]
AutoAnchor: thr=0.25: 1.0000 best possible recall, 8.01 anchors past thr
AutoAnchor: n=9, img_size=1280, metric_all=0.560/0.883-mean/best, past_thr=0.606-mean: 24,26, 33,30, 33,38, 43,42, 55,37, 45,57, 69,61, 114,89, 196,156
AutoAnchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.
Image sizes 1280 train, 1280 val
Using 8 dataloader workers
Logging results to runs/evolve/exp2
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     15.2G    0.1173   0.07811         0        24      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     15.2G    0.1134   0.06076         0        41      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     15.2G    0.1083     0.059         0        37      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     15.2G    0.1042   0.06156         0        61      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       4/9     15.2G    0.1004   0.06473         0        73      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       5/9     15.2G   0.09586   0.06845         0        55      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       6/9     15.2G   0.09533   0.07238         0        40      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       7/9     15.2G   0.09381   0.07029         0        53      1280: 100% 26/26 [00:41<00:00,  1.59s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       8/9     15.2G    0.0932   0.06926         0        45      1280: 100% 26/26 [00:41<00:00,  1.58s/it]

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       9/9     15.2G   0.09265   0.07102         0        51      1280: 100% 26/26 [00:40<00:00,  1.57s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100% 20/20 [00:27<00:00,  1.36s/it]
                 all        786       4217     0.0254     0.0875      0.012    0.00281

10 epochs completed in 0.123 hours.

wandb: Waiting for W&B process to finish, PID 21130... (success).
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:   metrics/mAP_0.5:0.95 โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:      metrics/precision โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:         metrics/recall โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:         train/box_loss โ–ˆโ–ˆโ–‡โ–‡โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–โ–โ–โ–โ–โ–
wandb:         train/cls_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–
wandb:         train/obj_loss โ–ˆโ–ˆโ–‚โ–‚โ–โ–โ–‚โ–‚โ–ƒโ–ƒโ–„โ–„โ–†โ–†โ–…โ–…โ–…โ–…โ–…โ–…
wandb:           val/box_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:           val/cls_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–
wandb:           val/obj_loss โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–ˆโ–ˆโ–ˆ
wandb:                  x/lr0 โ–โ–โ–„โ–„โ–†โ–†โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–†โ–…โ–…โ–ƒโ–ƒโ–‚โ–‚
wandb:                  x/lr1 โ–โ–โ–„โ–„โ–†โ–†โ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–†โ–…โ–…โ–ƒโ–ƒโ–‚โ–‚
wandb:                  x/lr2 โ–ˆโ–ˆโ–‡โ–‡โ–‡โ–‡โ–†โ–†โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–โ–
wandb: 
wandb: Run summary:
wandb:             best/epoch 9
wandb:           best/mAP_0.5 0.01202
wandb:      best/mAP_0.5:0.95 0.00281
wandb:         best/precision 0.0254
wandb:            best/recall 0.0875
wandb:        metrics/mAP_0.5 0.01202
wandb:   metrics/mAP_0.5:0.95 0.00281
wandb:      metrics/precision 0.0254
wandb:         metrics/recall 0.0875
wandb:         train/box_loss 0.09265
wandb:         train/cls_loss 0.0
wandb:         train/obj_loss 0.07102
wandb:           val/box_loss 0.08335
wandb:           val/cls_loss 0.0
wandb:           val/obj_loss 0.14715
wandb:                  x/lr0 0.00031
wandb:                  x/lr1 0.00031
wandb:                  x/lr2 0.07877
wandb: 
wandb: Synced 5 W&B file(s), 64 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced happy-wave-17: https://wandb.ai/lzziuhh/evolve/runs/grrpvh1q
wandb: Find logs at: ./wandb/run-20220109_023148-grrpvh1q/logs/debug.log
wandb: 
wandb: Tracking run with wandb version 0.12.9
wandb: Syncing run noble-sponge-18
wandb: โญ๏ธ View project at https://wandb.ai/lzziuhh/evolve
wandb: ๐Ÿš€ View run at https://wandb.ai/lzziuhh/evolve/runs/3w2ab7d3
wandb: Run data is saved locally in /kaggle/working/yolov5/wandb/run-20220109_023935-3w2ab7d3
wandb: Run `wandb offline` to turn off syncing.

wandb: Waiting for W&B process to finish, PID 21424... (success).
wandb:                                                                                
wandb: Run history:
wandb:        metrics/mAP_0.5 โ–
wandb:   metrics/mAP_0.5:0.95 โ–
wandb:      metrics/precision โ–
wandb:         metrics/recall โ–
wandb:           val/box_loss โ–
wandb:           val/cls_loss โ–
wandb:           val/obj_loss โ–
wandb: 
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.01202
wandb:   metrics/mAP_0.5:0.95 0.00281
wandb:      metrics/precision 0.0254
wandb:         metrics/recall 0.0875
wandb:           val/box_loss 0.08335
wandb:           val/cls_loss 0.0
wandb:           val/obj_loss 0.14715
wandb: 
wandb: Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced noble-sponge-18: https://wandb.ai/lzziuhh/evolve/runs/3w2ab7d3
wandb: Find logs at: ./wandb/run-20220109_023935-3w2ab7d3/logs/debug.log
wandb: 
wandb: Tracking run with wandb version 0.12.9
wandb: Syncing run deep-cherry-19
wandb: โญ๏ธ View project at https://wandb.ai/lzziuhh/evolve
wandb: ๐Ÿš€ View run at https://wandb.ai/lzziuhh/evolve/runs/2t57yqlh
wandb: Run data is saved locally in /kaggle/working/yolov5/wandb/run-20220109_023946-2t57yqlh
wandb: Run `wandb offline` to turn off syncing.

Results saved to runs/evolve/exp2
evolve:    metrics/precision,       metrics/recall,      metrics/mAP_0.5, metrics/mAP_0.5:0.95,         val/box_loss,         val/obj_loss,         val/cls_loss,             anchor_t,                  box,                  cls,               cls_pw,           copy_paste,              degrees,             fl_gamma,               fliplr,               flipud,                hsv_h,                hsv_s,                hsv_v,                iou_t,                  lr0,                  lrf,                mixup,             momentum,               mosaic,                  obj,               obj_pw,          perspective,                scale,                shear,            translate,       warmup_bias_lr,        warmup_epochs,      warmup_momentum,         weight_decay,              anchors
evolve:             0.025397,             0.087503,             0.012024,            0.0028118,             0.083352,              0.14715,                    0,                    4,              0.04921,                  0.5,                    1,                    0,                    0,                    0,                  0.5,                    0,              0.01545,              0.68469,                  0.3,                  0.2,                 0.01,               0.0972,              0.45586,                0.937,              0.52414,              0.90521,                    1,                    0,                  0.7,                    0,              0.10506,              0.10589,               4.9193,               0.7835,              0.00051,                    3

hyperparameters: anchor_t=4.06295, box=0.04868, cls=0.5183, cls_pw=0.9943, copy_paste=0.0, degrees=0.0, fl_gamma=0.0, fliplr=0.5, flipud=0.0, hsv_h=0.01554, hsv_s=0.72562, hsv_v=0.3369, iou_t=0.2, lr0=0.01, lrf=0.1116, mixup=0.5, momentum=0.91729, mosaic=0.5072, obj=0.98705, obj_pw=0.97617, perspective=0.0, scale=0.69435, shear=0.0, translate=0.08572, warmup_bias_lr=0.09905, warmup_epochs=5.0, warmup_momentum=0.80387, weight_decay=0.0005, anchors=2.33706
Overriding model.yaml nc=80 with nc=1
Overriding model.yaml anchors with anchors=2.33706

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     10788  models.yolo.Detect                      [1, [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]], [128, 256, 512]]
Model Summary: 270 layers, 7016932 parameters, 7016932 gradients, 15.8 GFLOPs

Transferred 342/349 items from yolov5s.pt
Scaled weight_decay = 0.00046875
optimizer: SGD with parameter groups 57 weight, 60 weight (no decay), 60 bias
albumentations: version 1.0.3 required by YOLOv5, but version 0.1.12 is currently installed
train: Scanning '/kaggle/working/COTS/labels/train.cache' images and labels... 510 found, 0 missing, 0 empty, 0 corrupted: 100% 510/510 [00:00<?, ?it/s]
train: Caching images (1.4GB ram): 100% 510/510 [00:02<00:00, 200.24it/s]
val: Scanning '/kaggle/working/COTS/labels/valid.cache' images and labels... 786 found, 0 missing, 0 empty, 0 corrupted: 100% 786/786 [00:00<?, ?it/s]

AutoAnchor: 0.00 anchors/target, 0.000 Best Possible Recall (BPR). Anchors are a poor fit to dataset โš ๏ธ, attempting to improve...
AutoAnchor: Running kmeans for 6 anchors on 1522 points...
AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.8613: 100% 1000/1000 [00:08<00:00, 114.67it/s]
AutoAnchor: thr=0.25: 1.0000 best possible recall, 5.42 anchors past thr
AutoAnchor: n=6, img_size=1280, metric_all=0.570/0.861-mean/best, past_thr=0.611-mean: 25,26, 34,34, 46,40, 45,56, 71,58, 157,134
AutoAnchor: New anchors saved to model. Update model *.yaml to use these anchors in the future.
Image sizes 1280 train, 1280 val
Using 8 dataloader workers
Logging results to runs/evolve/exp2
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9     15.2G    0.1147   0.07954         0        64      1280: 100% 26/26 [00:41<00:00,  1.59s/it]
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 614, in main
    results = train(hyp.copy(), opt, device, callbacks)
  File "train.py", line 383, in train
    callbacks.run('on_fit_epoch_end', log_vals, epoch, best_fitness, fi)
  File "/kaggle/working/yolov5/utils/callbacks.py", line 77, in run
    logger['callback'](*args, **kwargs)
  File "/kaggle/working/yolov5/utils/loggers/__init__.py", line 132, in on_fit_epoch_end
    self.wandb.wandb_run.summary[name] = best_results[i]  # log best results in the summary
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 54, in __setitem__
    self.update({key: val})
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 76, in update
    self._update(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 131, in _update
    self._update_callback(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 966, in _summary_update_callback
    self._backend.interface.publish_summary(summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 313, in publish_summary
    self._publish_summary(pb_summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 282, in _publish_summary
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_queue.py", line 45, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 614, in main
    results = train(hyp.copy(), opt, device, callbacks)
  File "train.py", line 383, in train
    callbacks.run('on_fit_epoch_end', log_vals, epoch, best_fitness, fi)
  File "/kaggle/working/yolov5/utils/callbacks.py", line 77, in run
    logger['callback'](*args, **kwargs)
  File "/kaggle/working/yolov5/utils/loggers/__init__.py", line 132, in on_fit_epoch_end
    self.wandb.wandb_run.summary[name] = best_results[i]  # log best results in the summary
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 54, in __setitem__
    self.update({key: val})
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 76, in update
    self._update(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 131, in _update
    self._update_callback(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 966, in _summary_update_callback
    self._backend.interface.publish_summary(summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 313, in publish_summary
    self._publish_summary(pb_summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 282, in _publish_summary
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_queue.py", line 45, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 614, in main
    results = train(hyp.copy(), opt, device, callbacks)
  File "train.py", line 383, in train
    callbacks.run('on_fit_epoch_end', log_vals, epoch, best_fitness, fi)
  File "/kaggle/working/yolov5/utils/callbacks.py", line 77, in run
    logger['callback'](*args, **kwargs)
  File "/kaggle/working/yolov5/utils/loggers/__init__.py", line 132, in on_fit_epoch_end
    self.wandb.wandb_run.summary[name] = best_results[i]  # log best results in the summary
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 54, in __setitem__
    self.update({key: val})
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 76, in update
    self._update(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 131, in _update
    self._update_callback(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 966, in _summary_update_callback
    self._backend.interface.publish_summary(summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 313, in publish_summary
    self._publish_summary(pb_summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 282, in _publish_summary
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_queue.py", line 45, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
Traceback (most recent call last):
  File "train.py", line 636, in <module>
    main(opt)
  File "train.py", line 614, in main
    results = train(hyp.copy(), opt, device, callbacks)
  File "train.py", line 383, in train
    callbacks.run('on_fit_epoch_end', log_vals, epoch, best_fitness, fi)
  File "/kaggle/working/yolov5/utils/callbacks.py", line 77, in run
    logger['callback'](*args, **kwargs)
  File "/kaggle/working/yolov5/utils/loggers/__init__.py", line 132, in on_fit_epoch_end
    self.wandb.wandb_run.summary[name] = best_results[i]  # log best results in the summary
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 54, in __setitem__
    self.update({key: val})
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 76, in update
    self._update(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_summary.py", line 131, in _update
    self._update_callback(record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/wandb_run.py", line 966, in _summary_update_callback
    self._backend.interface.publish_summary(summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface.py", line 313, in publish_summary
    self._publish_summary(pb_summary_record)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_shared.py", line 282, in _publish_summary
    self._publish(rec)
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/interface/interface_queue.py", line 45, in _publish
    raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

wandb: Waiting for W&B process to finish, PID 21479... (failed 1). Press ctrl-c to abort syncing.
wandb:                                                                                
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced deep-cherry-19: https://wandb.ai/lzziuhh/evolve/runs/2t57yqlh
wandb: Find logs at: ./wandb/run-20220109_023946-2t57yqlh/logs/debug.log
wandb: 

cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire

carmocca commented 2 years ago

Hi! Can you share a reproducible script?

What's your lightning and wandb versions?

From the stacktrace, it looks it's more of a wandb problem than lightning

lzziuhh commented 2 years ago

yes, it looks like wandb problem, I will check with them

AyushExel commented 2 years ago

@lzziuhh Hey! a member of W&B team here. Thanks for reporting this. Is this reproducible for you? Is the script crashing after 10 epochs always? If so, this might be a problem with YOLOv5 evolve integration. I'll take a look and get back to you

lzziuhh commented 2 years ago

Thanks for reply. Yes, this is reproducible. it is always crashing after 20 epochs , at the beginning of 3rd round of the evolve.. It does not happen when use evolve without W&B.

thanks & regards Liu

2022ๅนด1ๆœˆ10ๆ—ฅ(ๆœˆ) 19:34 Ayush Chaurasia @.***>:

@lzziuhh https://github.com/lzziuhh Hey! a member of W&B team here. Thanks for reporting this. Is this reproducible for you? Is the script crashing after 10 epochs always? If so, this might be a problem with YOLOv5 evolve integration. I'll take a look and get back to you

โ€” Reply to this email directly, view it on GitHub https://github.com/PyTorchLightning/pytorch-lightning/issues/11373#issuecomment-1008737807, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5HF7V25OA3LVNQB4TXTT3UVKYZ5ANCNFSM5LRKN2OA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

AyushExel commented 2 years ago

@lzziuhh thanks. I'll look into it this week