wandb图片的Step和实际训练的epoch不一致

Wanghe1997 commented 3 years ago

作者你好，我在train.py里将--epochs参数设置为1000，但是在wandb里所有曲线图的横坐标Step都是0-12K，对应不上。这是怎么回事呢？如何修改正确？wandb中的Logs也是训练到1000轮，为什么Step最大值会显示为12K呢？谢谢 wandb1 wandb2 wandb3

WongKinYiu commented 3 years ago

我猜您的 batch size 設定為 8.

Wanghe1997 commented 3 years ago

我猜您的 batch size 設定為 8.

我的batchsize设定为16。但是这跟wandb的Step有什么关系呢

WongKinYiu commented 3 years ago

總資料數 ~= 93*batch 總訓練資料量 = 1000*總資料數更新資料量 = 64 總訓練資料量 = 更新資料量*更新次數 1000*93*batch = 64*12000 算起來 batch 應該是 8.25 左右

Wanghe1997 commented 3 years ago

我猜您的 batch size 設定為 8.

所以您觉得问题出在哪？应该怎么解决？在wandb-summary.json文件了，_step参数就是12899

WongKinYiu commented 3 years ago

是單gpu訓練嗎? 我自己訓練沒有用wanddb, 但您提供訓練指令我可以對應程式碼推算.

Wanghe1997 commented 3 years ago

是單gpu訓練嗎? 我自己訓練沒有用wanddb, 但您提供訓練指令我可以對應程式碼推算.

是的，单GPU训练。您的代码默认不是开启wandb的吗？如何关闭或者用传统的tensorboard看呢?您的Readme没有写

Wanghe1997 commented 3 years ago

是單gpu訓練嗎? 我自己訓練沒有用wanddb, 但您提供訓練指令我可以對應程式碼推算.

我用的是你主页最新的代码。您需要哪些指令或者图片？我可以提供给您看看

WongKinYiu commented 3 years ago

您的訓練指令和訓練資料數量.

Wanghe1997 commented 3 years ago

您的訓練指令和訓練資料數量.

opt.yaml：

weights: weights/yolov4.weights cfg: cfg/yolov4.cfg data: data/garbage.yaml hyp: data/hyp.scratch.yaml epochs: 1000 batch_size: 16 img_size:

640

640 rect: false resume: false nosave: false notest: false noautoanchor: false evolve: false bucket: '' cache_images: false image_weights: false device: '0' multi_scale: false single_cls: false adam: false sync_bn: false local_rank: -1 log_imgs: 16 workers: 8 project: runs/train name: yolov4-puxiang exist_ok: false total_batch_size: 16 world_size: 1 global_rank: -1 save_dir: runs\train\yolov4-puxiang

train和val都是一样的图片，共1480张

garbage.yaml

train: ./images/data0904/images/train/ val: ./images/data0904/images/val/ nc: 10 names: ['closed', 'open', 'gripper','closestool','mattress','water','gascan','wood','algam','concreteblock']

训练指令

python train.py --device 0 --batch-size 16 --img 640 640 --data data/garbage.yaml --cfg cfg/yolov4.cfg --weights weight/yolov4.weights --name yolov4-puxiang

hyp.scratch.yaml

lr0: 0.001 # initial learning rate (SGD=1E-2, Adam=1E-3) lrf: 0.2 # final OneCycleLR learning rate (lr0 * lrf) warmup_epochs: 3.0 # warmup epochs (fractions ok) warmup_momentum: 0.8 # warmup initial momentum warmup_bias_lr: 0.1 # warmup initial bias lr momentum: 0.937 # SGD momentum/Adam beta1 weight_decay: 0.0005 # optimizer weight decay 5e-4 giou: 0.05 # GIoU loss gain cls: 0.3 # cls loss gain cls_pw: 1.0 # cls BCELoss positive_weight obj: 0.7 # obj loss gain (scale with pixels) obj_pw: 1.0 # obj BCELoss positive_weight iou_t: 0.20 # IoU training threshold anchor_t: 4.0 # anchor-multiple threshold fl_gamma: 0.0 # focal loss gamma (efficientDet default gamma=1.5) hsv_h: 0.015 # image HSV-Hue augmentation (fraction) hsv_s: 0.7 # image HSV-Saturation augmentation (fraction) hsv_v: 0.4 # image HSV-Value augmentation (fraction) degrees: 0.0 # image rotation (+/- deg) translate: 0.1 # image translation (+/- fraction) scale: 0.9 # image scale (+/- gain) shear: 0.0 # image shear (+/- deg) perspective: 0.0 # image perspective (+/- fraction), range 0-0.001 flipud: 0.0 # image flip up-down (probability) fliplr: 0.5 # image flip left-right (probability) mosaic: 1.0 # image mosaic (probability) mixup: 1.0 # image mixup (probability)

PS：lr0改为0.001，加上了mixup: 1.0

WongKinYiu commented 3 years ago

每call一次wandb.log會計算一次step, 似乎是這行 for 迴圈裡有13個東西, 所以跑了將近13倍的steps. https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L361 可以把這13個項目建成字典, 一次塞, 應該就沒問題了.

https://github.com/WongKinYiu/PyTorch_YOLOv4/issues/364#issuecomment-923857145 這個是網路更新 step 次數的算法, 與wandb的steps無關. wandb就是call一次log就steps + 1.

Wanghe1997 commented 3 years ago

每call一次wandb.log會計算一次step, 似乎是這行 for 迴圈裡有13個東西, 所以跑了將近13倍的steps. https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L361 可以把這13個項目建成字典, 一次塞, 應該就沒問題了.

#364 (comment) 這個是網路更新 step 次數的算法, 與wandb的steps無關. wandb就是call一次log就steps + 1.

这不是你写的代码吗？为什么这些问题之前没发现呢？😂我直接git clone主页的代码的

Wanghe1997 commented 3 years ago

每call一次wandb.log會計算一次step, 似乎是這行 for 迴圈裡有13個東西, 所以跑了將近13倍的steps. https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L361 可以把這13個項目建成字典, 一次塞, 應該就沒問題了.

#364 (comment) 這個是網路更新 step 次數的算法, 與wandb的steps無關. wandb就是call一次log就steps + 1.

我不知道怎么改😂

WongKinYiu commented 3 years ago

大概這樣吧, 我沒裝wandb, 要等您跑跑看了.

https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L357-L361

            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb.log({tag: x})  # W&B

改成

            if wandb:
                wandb_log_dict = {}
            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb_log_dict[tag] = x
            if wandb:
                wandb.log(wandb_log_dict)  # W&B

Wanghe1997 commented 3 years ago

大概這樣吧, 我沒裝wandb, 要等您跑跑看了.

https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L357-L361

            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb.log({tag: x})  # W&B

改成

            if wandb:
                wandb_log_dict = {}
            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb_log_dict[tag] = x
            if wandb:
                wandb.log(wandb_log_dict)  # W&B

好的，谢谢。我还发现运行test.py会出现以下错误： Traceback (most recent call last): File "test.py", line 319, in test(opt.data, File "test.py", line 241, in test wandb.log({"Validation": [wandb.Image(str(x), caption=x.name) for x in sorted(save_dir.glob('test*.jpg'))]}) File "E:\ProgramData\Anaconda3\envs\wanghe\lib\site-packages\wandb\sdk\lib\preinit.py", line 38, in preinit_wrapper raise wandb.Error("You must call wandb.init() before {}()".format(name)) wandb.errors.Error: You must call wandb.init() before wandb.log()

还是wandb的代码部分出了问题，得把test.py中L239至L241的代码注释了，才能正常运行test.py

Wanghe1997 commented 3 years ago

大概這樣吧, 我沒裝wandb, 要等您跑跑看了.

https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/train.py#L357-L361

            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb.log({tag: x})  # W&B

改成

            if wandb:
                wandb_log_dict = {}
            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if tb_writer:
                    tb_writer.add_scalar(tag, x, epoch)  # tensorboard
                if wandb:
                    wandb_log_dict[tag] = x
            if wandb:
                wandb.log(wandb_log_dict)  # W&B

我还想问您的是，您没装wandb的话，如果想看训练时loss和准确率等等这些指标的曲线图您是怎么监测的呢？是用tensorboard监测训练时产生的events日志吗？

WongKinYiu commented 3 years ago

我都看 results.txt, 很少用圖形介面. 不過 metrics.py 用的算法也不完全準確就是了. 基本上 results.txt 裡的 loss 是正確的, ap 和 recall 我都用 pycocotools 算.

Wanghe1997 commented 3 years ago

我都看 results.txt, 很少用圖形介面. 不過 metrics.py 用的算法也不完全準確就是了. 基本上 results.txt 裡的 loss 是正確的, ap 和 recall 我都用 pycocotools 算.

results.txt里面不是统计了每一个epoch的过程吗？如何使用pycocotools计算呢？能教一下吗？谢谢。您的方法是不是执行test.py文件，然后得到我运行时类似这样的结果呢： test

WongKinYiu commented 3 years ago

要把custom data 的 label 也轉成 coco json format, 才能用 pycocotools 算.

Wanghe1997 commented 3 years ago

要把custom data 的 label 也轉成 coco json format, 才能用 pycocotools 算.

这样，如果是txt的话，还有什么其他办法计算ap和recall。你说的这个pycocotools计算指标的方法有没有哪个网站有教程呢？

WongKinYiu commented 3 years ago

https://medium.datadriveninvestor.com/how-to-create-custom-coco-data-set-for-object-detection-96ec91958f36

Wanghe1997 commented 3 years ago

https://medium.datadriveninvestor.com/how-to-create-custom-coco-data-set-for-object-detection-96ec91958f36

谢谢。这篇文章讲的是voc转成json format，如果有了json文件，如何使用pycocotools计算ap等参数？有关于后半部分这个问题的教程吗？

WongKinYiu commented 3 years ago

https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/test.py#L266-L280

shyhyawJou commented 3 years ago

我都看 results.txt, 很少用圖形介面. 不過 metrics.py 用的算法也不完全準確就是了. 基本上 results.txt 裡的 loss 是正確的, ap 和 recall 我都用 pycocotools 算.

您好唷嗯.... metric.py 不一定準確???? 那訓練階段時, 顯示驗證集的 mAP 或 best fitness 等等的數值, 也不對嗎?! 可是我看您的程式存best.pt 是根據這些指標去存的耶.... 然後您的result.txt 不是也只是把這些數值直接 print在 result.txt 而已嗎

WongKinYiu commented 3 years ago

同樣threshold下的相對高低沒有問題. 但是調高threshold有時候會觀察到AP變高的狀況, 這狀況不應該發生. 應該是3個用到的source code中, 或組合的時候出了點問題. https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L45 https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L66 https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L116

shyhyawJou commented 3 years ago

同樣threshold下的相對高低沒有問題. 但是調高threshold有時候會觀察到AP變高的狀況, 這狀況不應該發生. 應該是3個用到的source code中, 或組合的時候出了點問題. https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L45 https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L66 https://github.com/WongKinYiu/PyTorch_YOLOv4/blob/master/utils/metrics.py#L116

我好像看懂您的意思了您意思是這些計算指標部分的 code 是調用他人寫好的 (rcnn 的?) 然後您目前有發現這些code似乎有bug的意思嗎?

WongKinYiu commented 3 years ago

pytorch版本是基於u版yolo的, u版每個版本計算出的ap等數值都不同, 所以我還是建議都用pycocotools算準度.

shyhyawJou commented 3 years ago

pytorch版本是基於u版yolo的, u版每個版本計算出的ap等數值都不同, 所以我還是建議都用pycocotools算準度.

了解, 那請問還有其他部分可能有問題嗎另外, 我看您 test.py的程式碼, 其實也是調用coco api 的function 或class 計算出 mAP0.5, mAP 0.5:0.95 的, 所以您的意思是除了這兩個指標沒問題外, 其他可能有問題?

WongKinYiu / PyTorch_YOLOv4