train wandb error - Githubissues

Deep-learning999 commented 1 year ago

!python train.py --data data/coco_kpts.yaml --cfg cfg/yolov7-w6-pose.yaml --weights weights/yolov7-w6-person.pt --batch-size 128 --img 960 --kpt-label --sync-bn --device 0 --name yolov7-w6-pose --hyp data/hyp.pose.yaml

Namespace(adam=False, artifact_alias='latest', batch_size=128, bbox_interval=-1, bucket='', cache_images=False, cfg='cfg/yolov7-w6-pose.yaml', data='data/coco_kpts.yaml', device='0', entity=None, epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='data/hyp.pose.yaml', image_weights=False, img_size=[960, 960], kpt_label=True, label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='yolov7-w6-pose', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs\train\yolov7-w6-pose', save_period=-1, single_cls=False, sync_bn=True, total_batch_size=128, upload_dataset=False, weights='weights/yolov7-w6-person.pt', workers=8, world_size=1) tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, kpt=0.1, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0 Traceback (most recent call last): File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_init.py", line 1040, in init wi.setup(kwargs) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_init.py", line 151, in setup self._wl = wandb_setup.setup() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 320, in setup ret = _setup(settings=settings) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 315, in _setup wl = _WandbSetup(settings=settings) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 301, in init _WandbSetup._instance = _WandbSetupWandbSetup(settings=settings, pid=pid) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 114, in init self._setup() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 242, in _setup self._setup_manager() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 273, in _setup_manager self._manager = wandb_manager._Manager( File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_manager.py", line 106, in init self._service.start() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\service\service.py", line 106, in start self._launch_server() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\service\service.py", line 102, in _launch_server assert ports_found AssertionError wandb: ERROR Abnormal program exit Traceback (most recent call last): File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_init.py", line 1040, in init wi.setup(kwargs) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_init.py", line 151, in setup self._wl = wandb_setup.setup() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 320, in setup ret = _setup(settings=settings) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 315, in _setup wl = _WandbSetup(settings=settings) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 301, in init _WandbSetup._instance = _WandbSetupWandbSetup(settings=settings, pid=pid) File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 114, in init self._setup() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 242, in _setup self._setup_manager() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_setup.py", line 273, in _setup_manager self._manager = wandb_manager._Manager( File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_manager.py", line 106, in init self._service.start() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\service\service.py", line 106, in start self._launch_server() File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\service\service.py", line 102, in _launch_server assert ports_found AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 562, in train(hyp, opt, device, tb_writer) File "train.py", line 72, in train wandb_logger = WandbLogger(opt, save_dir.stem, run_id, data_dict) File "C:\Users\designlion\notebooks\0000098yolov7-pose\utils\wandb_logging\wandb_utils.py", line 95, in init self.wandb_run = wandb.init(config=opt, File "E:\anaconda3\envs\sleap\lib\site-packages\wandb\sdk\wandb_init.py", line 1081, in init raise Exception("problem") from error_seen Exception: problem

Deep-learning999 commented 1 year ago

Does it need a vpn over the wall to work?

akashAD98 commented 1 year ago

@Deep-learning999 login with wandb, otherwise disable wandb

Deep-learning999 commented 1 year ago

Only this information 0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 12 13 -1 1 14 [-1, 8] 1 15 16 17 -1 1 18 [-1, 6] 1 19 20 21 -1 1 22 [-1, 4] 1 23 24 25 [-1, 20] 1 26 27 28 [-1, 16] 1 29 30 31 [-1, 12] 1 32 33 [23, 26, 29, 32] 1 E:\anaconda3\envs\sleap return _VF.meshgrid(tensors, Model Summary: Transferred 470/744 Scaled weight_decay = 0.0005 Optimizer groups: train: Scanning [3, 5, 7] train: New cache when running for 3 hours 3520 models.common.Focus [3, 32, 3] 18560 models.common.Conv [32, 64, 3, 2] 18816 models.common.C3 [64, 64, 1] 73984 models.common.Conv [64, 128, 3, 2] -1 1 156928 models.common.C3 [128, 128, 3] -1 1 295424 models.common.Conv [128, 256, 3, 2] -1 1 625152 models.common.C3 [256, 256, 3] -1 1 885504 models.common.Conv [256, 384, 3, 2] -1 1 665856 models.common.C3 [384, 384, 1] -1 1 1770496 models.common.Conv [384, 512, 3, 2] -1 1 656896 models.common.SPP [512, 512, [3, 5, 7]] -1 1 1182720 models.common.C3 [512, 512, 1, False] -1 1 197376 models.common.Conv [512, 384, 1, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 813312 models.common.C3 [768, 384, 1, False] -1 1 98816 models.common.Conv [384, 256, 1, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 361984 models.common.C3 [512, 256, 1, False] -1 1 33024 models.common.Conv [256, 128, 1, 1] 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 0 models.common.Concat [1] -1 1 90880 models.common.C3 [256, 128, 1, False] -1 1 147712 models.common.Conv [128, 128, 3, 2] 0 models.common.Concat [1] -1 1 296448 models.common.C3 [256, 256, 1, False] -1 1 590336 models.common.Conv [256, 256, 3, 2] 0 models.common.Concat [1] -1 1 715008 models.common.C3 [512, 384, 1, False] -1 1 1327872 models.common.Conv [384, 384, 3, 2] 0 models.common.Concat [1] -1 1 1313792 models.common.C3 [768, 512, 1, False] 2774444 models.yolo.Detect [1, [[19, 27, 44, 40, 38, 94], [96, 68, 86, 152, 180, 137], [140, 301, 303, 264, 238, 542], [436, 615, 739, 380, 925, 792]], 17, [128, 256, 384, 512]] \lib\site-packages\torch\functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:2157.) **kwargs) # type: ignore[attr-defined] 557 layers, 15114860 parameters, 15114860 gradients, 20.5 GFLOPS items from Yolov5s6_person_640.pt 129 .bias, 129 conv.weight, 121 other 'coco_kpts\train2017' images and labels... 56599 found, 0 missing, 0 empty, 0 corrupted: 100%|██████████| 56599/56599 [27:10<00:00, 34.71it/s] created: coco_kpts\train2017.cache

Deep-learning999 commented 1 year ago

Epoch gpu_mem box obj cls kpt kptv total labels img_size 29/299 18G 0.0375 0.02693 0 0.2554 0.01235 0.3322 Class Images Labels P R mAP@.5 Traceback (most recent call last): File "train.py", line 550, in train(hyp, opt, device, tb_writer) File "train.py", line 357, in train results, maps, times = test.test(data_dict, File "/hy-nas/0010edgeai-yolov5-yolo-pose2/test.py", line 196, in test box_data = [{"position": {"minX": xyxy[0], "minY": xyxy[1], "maxX": xyxy[2], "maxY": xyxy[3]}, File "/hy-nas/0010edgeai-yolov5-yolo-pose2/test.py", line 198, in "box_caption": "%s %.3f" % (names[cls], conf), KeyError: 0.77587890625 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb:
wandb: wandb: Run history: wandb: metrics/mAP_0.5 ▇▇▆▆█▃▅▅▇▅▇██▆▅▄▆▆█▁▂▄▅▃▅▂▁▂▆ wandb: metrics/mAP_0.5:0.95 ▇▇▇▇▇▇██▇▇▆▆▆▇▇▇▇▃▂▁▅▃▄▅▄▃▆▅▃ wandb: metrics/precision ██▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▁▁▁ wandb: metrics/recall █▇▇▇▇▇▇▇▇▆▆▅▅▅▅▄▄▄▃▃▃▂▂▂▂▂▁▁▁ wandb: train/box_loss █▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁ wandb: train/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: train/obj_loss █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: val/box_loss █████▇█▇▇▇▇▇▆▆▆▆▆▅▄▁▅▄▄▄▅▄▅▃▃ wandb: val/cls_loss █▇▇█▇▆▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂ wandb: val/obj_loss ▇██▇█▅▇▅▆▅▆▆▅▅▅▆▆▄▃▂▄▃▃▃▃▂▂▁▁ wandb: x/lr0 █▇▆▆▆▇▆▃▂▂▂▁▁▂▁▁▁▂▁▁▃▁▂▂▂▂▃▂▂ wandb: x/lr1 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: x/lr2 ▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███ wandb: wandb: Run summary: wandb: metrics/mAP_0.5 0.85741 wandb: metrics/mAP_0.5:0.95 0.73603 wandb: metrics/precision 0.25729 wandb: metrics/recall 0.01245 wandb: train/box_loss 0.03601 wandb: train/cls_loss 0.0 wandb: train/obj_loss 0.02597 wandb: val/box_loss 0.82034 wandb: val/cls_loss 0.03751 wandb: val/obj_loss 0.50831 wandb: x/lr0 0.01829 wandb: x/lr1 0.0 wandb: x/lr2 0.00683 wandb: wandb: Synced exp7: https://wandb.ai/deeplearn888/YOLOv5/runs/te52ex3c wandb: Synced 5 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20220904_021446-te52ex3c/logs

June1124 commented 1 year ago

Epoch gpu_mem box obj cls kpt kptv total labels img_size 29/299 18G 0.0375 0.02693 0 0.2554 0.01235 0.3322 Class Images Labels P R mAP@.5 Traceback (most recent call last): File "train.py", line 550, in train(hyp, opt, device, tb_writer) File "train.py", line 357, in train results, maps, times = test.test(data_dict, File "/hy-nas/0010edgeai-yolov5-yolo-pose2/test.py", line 196, in test box_data = [{"position": {"minX": xyxy[0], "minY": xyxy[1], "maxX": xyxy[2], "maxY": xyxy[3]}, File "/hy-nas/0010edgeai-yolov5-yolo-pose2/test.py", line 198, in "box_caption": "%s %.3f" % (names[cls], conf), KeyError: 0.77587890625 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: wandb: wandb: Run history: wandb: metrics/mAP_0.5 ▇▇▆▆█▃▅▅▇▅▇██▆▅▄▆▆█▁▂▄▅▃▅▂▁▂▆ wandb: metrics/mAP_0.5:0.95 ▇▇▇▇▇▇██▇▇▆▆▆▇▇▇▇▃▂▁▅▃▄▅▄▃▆▅▃ wandb: metrics/precision ██▇▇▇▇▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▁▁▁ wandb: metrics/recall █▇▇▇▇▇▇▇▇▆▆▅▅▅▅▄▄▄▃▃▃▂▂▂▂▂▁▁▁ wandb: train/box_loss █▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁ wandb: train/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: train/obj_loss █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: val/box_loss █████▇█▇▇▇▇▇▆▆▆▆▆▅▄▁▅▄▄▄▅▄▅▃▃ wandb: val/cls_loss █▇▇█▇▆▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂ wandb: val/obj_loss ▇██▇█▅▇▅▆▅▆▆▅▅▅▆▆▄▃▂▄▃▃▃▃▂▂▁▁ wandb: x/lr0 █▇▆▆▆▇▆▃▂▂▂▁▁▂▁▁▁▂▁▁▃▁▂▂▂▂▃▂▂ wandb: x/lr1 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ wandb: x/lr2 ▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███ wandb: wandb: Run summary: wandb: metrics/mAP_0.5 0.85741 wandb: metrics/mAP_0.5:0.95 0.73603 wandb: metrics/precision 0.25729 wandb: metrics/recall 0.01245 wandb: train/box_loss 0.03601 wandb: train/cls_loss 0.0 wandb: train/obj_loss 0.02597 wandb: val/box_loss 0.82034 wandb: val/cls_loss 0.03751 wandb: val/obj_loss 0.50831 wandb: x/lr0 0.01829 wandb: x/lr1 0.0 wandb: x/lr2 0.00683 wandb: wandb: Synced exp7: https://wandb.ai/deeplearn888/YOLOv5/runs/te52ex3c wandb: Synced 5 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20220904_021446-te52ex3c/logs

Hello, did you solve this problem?

June1124 commented 1 year ago

I tried to train several times, and each time, training to 69 times will prompt an error. 微信图片_20230223085648

qiaiqiai2 commented 1 year ago

line 198, "box_caption": "%s %.3f" % (names[cls], conf) -> "box_caption": "%s %.3f" % (names[int(cls)], conf)

ChenxMa commented 5 months ago

line 198, "box_caption": "%s %.3f" % (names[cls], conf) -> "box_caption": "%s %.3f" % (names[int(cls)], conf)

This is the possible solution. But still wonder why it ended in 29 epochs.

WongKinYiu / yolov7

train wandb error #689