Open beomgonyu opened 3 years ago
I have the same issue. It progressively gets slower for some reason.
Yes i am also experiencing this , struck at batch : 0
100%|██████████████████████████████████████████████████████████████████████████| 26/26 [00:26<00:00, 1.03s/it, loss=55]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00, 1.14it/s, loss=51.8]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:23<00:00, 1.13it/s, loss=49.9]
100%|████████████████████████████████████████████████████████████████████████| 26/26 [00:22<00:00, 1.16it/s, loss=49.2]
100%|███████████████████████████████████████████████████████████████████████████████████| 26/26 [00:07<00:00, 3.64it/s]
Class accuracy is: 9.126985%
No obj accuracy is: 0.085230%
Obj accuracy is: 99.735451%
0%| | 0/26 [00:00<?, ?it/s]eval batch : 0
The way I interpret this is that all candidate boxes are over the threshold so the evaluation takes forever. This might happen because of a very low threshold or the fact that in the beginning the objectless score very high. Because if you look No obj accuracy is very low which means that all boxes are passed as containing an object. I do not know if proper bias/weight initialization can fix this or if increasing the threshold. One thing that I tried is to do the evaluation after 10 epochs where the values will stabilize and not lead to may positive boxes.
@ckyrkou Thank you for the reply, Currently ,
20
epochsNMS_IOU_THRESH
to 0.75i am still getting 10647
bounding boxes as below,
Class accuracy is: 35.317459%
No obj accuracy is: 6.079705%
Obj accuracy is: 69.444443%
0%| | 0/26 [00:00<?, ?it/s]
nme 0
bboxes , 10647
any thoughts ?
The No obj accuracy is still very low. You need to change CONF_THRESHOLD for that. In the original config it is set to 0.05. I used CONF_THRESHOLD = 0.4. You can try that.
@ckyrkou Thank you i tried with CONF_THRESHOLD
=0.6 , it was working alright
@beomgonyu you can please try this and see if that works :-)
@guruprasaad123 Good to hear? Did you manage to reproduce the accuracies reported in the repo for pascal_voc?
@ckyrkou i tried to reproduce the accuracy that is > 78 for pascal_voc , but i could'nt get to that level as of now. This is what i am getting after 20 epochs,
Class accuracy is: 54.754784%
No obj accuracy is: 100.000000%
Obj accuracy is: 0.000000%
MAP: 0.0
and am still running the script , if i get any improvements on accuracy i would let you know for sure.
Thanks. I tried running it for 100 epochs achieving up to 46 map. I was wondering if running for more would increase performance. I noticed that the parameters in the video are different than what is actually is in the repo.
@ckyrkou cool , ofcourse parameters are different i too noticed , i too was wondering what could be the ideal parameter to get max mAP > 78 , i am also running for more than 100 epochs if i get improvement i would let you know for sure.
Try suppressing the number of boxes by amending the second line in the non_max_suppression function to bboxes = sorted(bboxes, key=lambda x: x[1], reverse=True)[:max_boxes]
. Also, evaluate after a couple of epochs so that the model has had the chance to converge a little bit first. I was running evaluation every 20 epochs for the 100examples.csv file.
@aningineer thank you i will surely try that out !
thanks that works @aningineer , i have to set max_boxes = 1024
Did you eventually managed to get the reported over 70% map?
nope i have not yet , i am still trying @ckyrkou
Same here. Haven't been able to reproduce 78% as described.
I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.
I solved this issue by removing evaluation at initial time. it seemed that there are so many predicted box, so it takes long time at initial time. after training is steady, evaluation works well.
@beomgonyu What final Map do you get for Pascal VoC??
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
@SimoDB90 ,
Did you notice that there are differences between the code of the video and the repository? for example in the config file.
As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).
At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.
I also added:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)
hey there, i'm trying to train on 100 examples. i was stucked with the same problem and at least now, it works. Still, every accuracy seems to not change and MAP is fixed at 0. Any idea why?
@SimoDB90 ,
Did you notice that there are differences between the code of the video and the repository? for example in the config file.
As i read that there were problems i trained 50 epoch without checking the validation and i saved the training file. Then I went back to training using the training file as pre-trained weights and now i can check accuracy(map).
At the moment and as long as my loss is not less than 0 I keep training to try to reach MAP 0.78.
I also added:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, patience=3, verbose=True)
Yes, i've noticed, and i used repository to clear my code. But i didn't find any impactfull difference, aside from CONF_THRESHOLD. On repository is 0.05, but below 0.6 (even with max_boxes = 1024), the training is painfully slow. But after 10 epochs, map is something like 0.0, or e-5. I'm trying to rerun on train.csv and test.csv, but i'm pretty sure that even on 100 examples or 8 examples, map should go up to 0.9 after few epochs. The fact that doesn't happen is driving me insane. Because i don't know why the sum of TP is always a tensor of 0. My loss is very often NaN. But i really can't understand why. I'm testing with conf_threshold of 0.5
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:
I suspect there is some problem in the code, but if the same code is working for you, i don't understand what to do more
@SimoDB90, tonight i will upload to gmail and share a checkpoint where i have a loss around 1.50 and you can start from there.
thank you a lot! Just one thing. Are the utils functions right? Or there is something in repository wrong?
i haven't checked it in detail but i think they are fine at least the model is converging despite taking its time:
* 9m per epox. * batchsize =8 I only have 6 GB of VRAM, more than that it fails
i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?
i'm working on google colab pro... my gtx 960 can only run to test if the code works :-) just to know, how many epochs do you need to see some non zero value of MAP?
I read in this thread that there were some problems at the beginning of the training so I trained without checking the accuracy and after 50 epox I started to check the accuracy.
I think when the loss was less than 20 I already started to have positive values, it's an estimate because I didn't check it at the beginning.
fine, thank you. Maybe i stopped the training too early then
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
good morning,
calm everything will resolve itself. must be one of those bugs that sometimes come up. I'm on vacation and I couldn't load the checkpoint, the hotel's WiFi is slow and the file is big. I will try to load later.
update: the checkpoint file is 740MB. I can't upload in hotel wifi, to slow :( In the end of next week I will be at home and i will upload.
by any chance did you try to train with a smaller batch size and in full precision fp32?
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
@SimoDB90, try to use these pre-trained weights(loss = 1.4) to continue your training:
(batch_size=8, lr=1e-6)
https://drive.google.com/file/d/1utjhWJ-KB11MsWNhWsE_J3xsh9QDMsLL/view?usp=sharing
I did some tests and got the best Map with CONF_THRESHOLD = 0.05 as it is in the config file in the github repository.
I'm wondering if it's worth continuing training until I have a smaller loss. How cool would it be to do this with vision transformer
I'm in vacancy. I'll back next week and see! Thank you!
Il ven 27 ago 2021, 17:58 JoaoProductDev @.***> ha scritto:
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
@SimoDB90 https://github.com/SimoDB90, try to use these pre-trained weights(loss = 1.4) to continue your training:
https://drive.google.com/file/d/1utjhWJ-KB11MsWNhWsE_J3xsh9QDMsLL/view?usp=sharing
I did some tests and got the best Map with CONF_THRESHOLD = 0.05 as it is in the config file in the github repository.
I'm wondering if it's worth continuing training until I have a smaller loss
How cool would it be to do this with vision transformer
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-907304903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJIRAXGMOOSEROOXIBIDT66YYXANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
well, i tried to train the 100examples.csv for 100 epochs. Never converged. Always map 0.0 and obj_accuracy 0%, while noobj_accuracy is always 100%. Don't know where, but there is a problem in the code i suppose. I watched carefully the repository and there are no differences. Tried with 0.6 of conf_threshold; 0.5 map_iou_threshold and nms_iou_threshold 0.45; learning rate 1e-5 and 0 weight decay
@SimoDB90, try to use these pre-trained weights(loss = 1.4) to continue your training: (batch_size=8, lr=1e-6)
https://drive.google.com/file/d/1utjhWJ-KB11MsWNhWsE_J3xsh9QDMsLL/view?usp=sharing
I did some tests and got the best Map with CONF_THRESHOLD = 0.05 as it is in the config file in the github repository.
I'm wondering if it's worth continuing training until I have a smaller loss. How cool would it be to do this with vision transformer
Hi! I'm back from holiday. I tried your weights, but i have this traceback:
RuntimeError: Error(s) in loading state_dict for YOLOv3: Missing key(s) in state_dict: "layers.0.conv.bias", "layers.1.conv.bias", "layers.2.layers.0.0.conv.bias", "layers.2.layers.0.1.conv.bias", "layers.3.conv.bias", "layers.4.layers.0.0.conv.bias", "layers.4.layers.0.1.conv.bias", "layers.4.layers.1.0.conv.bias", "layers.4.layers.1.1.conv.bias", "layers.5.conv.bias", "layers.6.layers.0.0.conv.bias", "layers.6.layers.0.1.conv.bias", "layers.6.layers.1.0.conv.bias", "layers.6.layers.1.1.conv.bias", "layers.6.layers.2.0.conv.bias", "layers.6.layers.2.1.conv.bias", "layers.6.layers.3.0.conv.bias", "layers.6.layers.3.1.conv.bias", "layers.6.layers.4.0.conv.bias", "layers.6.layers.4.1.conv.bias", "layers.6.layers.5.0.conv.bias", "layers.6.layers.5.1.conv.bias", "layers.6.layers.6.0.conv.bias", "layers.6.layers.6.1.conv.bias", "layers.6.layers.7.0.conv.bias", "layers.6.layers.7.1.conv.bias", "layers.7.conv.bias", "layers.8.layers.0.0.conv.bias", "layers.8.layers.0.1.conv.bias", "layers.8.layers.1.0.conv.bias", "layers.8.layers.1.1.conv.bias", "layers.8.layers.2.0.conv.bias", "layers.8.layers.2.1.conv.bias", "layers.8.layers.3.0.conv.bias", "layers.8.layers.3.1.conv.bias", "layers.8.layers.4.0.conv.bias", "layers.8.layers.4.1.conv.bias", "layers.8.layers.5.0.conv.bias", "layers.8.layers.5.1.conv.bias", "layers.8.layers.6.0.conv.bias", "layers.8.layers.6.1.conv.bias", "layers.8.layers.7.0.conv.bias", "layers.8.layers.7.1.conv.bias", "layers.9.conv.bias", "layers.10.layers.0.0.conv.bias", "layers.10.layers.0.1.conv.bias", "layers.10.layers.1.0.conv.bias", "layers.10.layers.1.1.conv.bias", "layers.10.layers.2.0.conv.bias", "layers.10.layers.2.1.conv.bias", "layers.10.layers.3.0.conv.bias", "layers.10.layers.3.1.conv.bias", "layers.11.conv.bias", "layers.12.conv.bias", "layers.13.layers.0.0.conv.bias", "layers.13.layers.0.1.conv.bias", "layers.14.conv.bias", "layers.15.pred.0.conv.bias", "layers.16.conv.bias", "layers.18.conv.bias", "layers.19.conv.bias", "layers.20.layers.0.0.conv.bias", "layers.20.layers.0.1.conv.bias", "layers.21.conv.bias", "layers.22.pred.0.conv.bias", "layers.23.conv.bias", "layers.25.conv.bias", "layers.26.conv.bias", "layers.27.layers.0.0.conv.bias", "layers.27.layers.0.1.conv.bias", "layers.28.conv.bias", "layers.29.pred.0.conv.bias".
I suspect my network is different from yours, but i used the same in youtube video (the same of repository, i double checked). Any idea?
(...)
I suspect my network is different from yours, but i used the same in youtube video (the same of repository, i double checked). Any idea?
I'm going to upload the code I used for debugging purposes
I can write the code of my network maybe?
Il lun 6 set 2021, 16:18 JoaoProductDev @.***> ha scritto:
(...)
I suspect my network is different from yours, but i used the same in youtube video (the same of repository, i double checked). Any idea?
I'm going to upload the code I used for debugging purposes
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-913685085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJISWI5HVKUBPHCFCMGDUATESHANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I can write the code of my network maybe? Il lun 6 set 2021, 16:18 JoaoProductDev @.***> ha scritto: (...)
Good afternoon,
this is the code I used. I think it's the same as the repository, maybe I have tweaked something but nothing relevant I would say.
Try to use this code with my weights.
ok.. i want to kill myself... after 2 hours, i found the problem: in the convblock, i put as default argument of batch norm active "False" instead of "True"
Now your weights work and at this point i think even the training would work well You were so kind, and you can't imagine how much you helped me! thank you soooooooooooooooo much!
Il giorno mar 7 set 2021 alle ore 14:58 JoaoProductDev < @.***> ha scritto:
I can write the code of my network maybe? Il lun 6 set 2021, 16:18 JoaoProductDev @.***> ha scritto: … <#m-9164532685742017910> (...) I suspect my network is different from yours, but i used the same in youtube video (the same of repository, i double checked). Any idea? I'm going to upload the code I used for debugging purposes — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment) https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-913685085>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJISWI5HVKUBPHCFCMGDUATESHANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .
Good afternoon,
this is the code I used. I think it's the same as the repository, maybe I have tweaked something but nothing relevant I would say.
Try to use this code with my weights.
yolov3.zip https://github.com/aladdinpersson/Machine-Learning-Collection/files/7121616/yolov3.zip
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-914284988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJIWDIGKZ4RE5WM7K5UTUAYEATANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Cordiali saluti, Simone.
ok.. i want to kill myself... after 2 hours, i found the problem: in the convblock, i put as default argument of batch norm active "False" instead of "True" Now your weights work and at this point i think even the training would work well You were so kind, and you can't imagine how much you helped me! thank you soooooooooooooooo much! Il giorno mar 7 set 2021 alle ore 14:58
no problem, it happens to everyone. glad I could help.
have you ever wondered if CNN is the way to go?
If an architecture were robust enough it wouldn't be sensitive to a One Pixel Attack. CNN lacks context. Is Vision Transformer the future of computer vision?
sorry for my thoughts, maybe it's not the subject of this thread.
I'm not very into transformers to be honest. I started doing some serious DL study just a couple of months ago. In the future, I'll going deep, I hope. For now, I'm super happy that I understand the code and my Net works.
I'm working on plotting some images with boxes right now. And I'm trying to implement a camera as frame input to make it in real time. Though work 😕
Il mar 7 set 2021, 17:37 JoaoProductDev @.***> ha scritto:
ok.. i want to kill myself... after 2 hours, i found the problem: in the convblock, i put as default argument of batch norm active "False" instead of "True" Now your weights work and at this point i think even the training would work well You were so kind, and you can't imagine how much you helped me! thank you soooooooooooooooo much! Il giorno mar 7 set 2021 alle ore 14:58
no problem, it happens to everyone. glad I could help.
have you ever wondered if CNN is the way to go?
If an architecture were robust enough it wouldn't be sensitive to a One Pixel Attack. CNN lacks context. Is Vision Transformer the future of computer vision?
sorry for my thoughts, maybe it's not the subject of this thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-914414167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJIWSVLTJNE2D67XNPTTUAYWVBANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I'm not very into transformers to be honest. I started doing some serious DL study just a couple of months ago. In the future, I'll going deep, I hope. For now, I'm super happy that I understand the code and my Net works. I'm
best of luck
Thanks again! Best luck to you!
Il mar 7 set 2021, 18:35 JoaoProductDev @.***> ha scritto:
I'm not very into transformers to be honest. I started doing some serious DL study just a couple of months ago. In the future, I'll going deep, I hope. For now, I'm super happy that I understand the code and my Net works. I'm
best of luck
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aladdinpersson/Machine-Learning-Collection/issues/62#issuecomment-914455849, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVIWJIXVGNYD7TXAZRMODRDUAY5LRANCNFSM47FCXYTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@ckyrkou Thank you i tried with
CONF_THRESHOLD
=0.6 , it was working alright @beomgonyu you can please try this and see if that works :-)
Bro can you please share your config parameter values for which you get fine No of Object and MAP accuracy.
@ckyrkou Thank you i tried with
CONF_THRESHOLD
=0.6 , it was working alright @beomgonyu you can please try this and see if that works :-)Bro can you please share your config parameter values for which you get fine No of Object and MAP accuracy.
I tried to train 100 example yesterday and the no_obj_acc was painfully 0-2,5% and map 0.02 or less.. With learning rate of 0,005. Don't know if it has to be changed to make the training do something. I tried different values of other parameters but doesn't work.
The weight a couple of post above does work thou. But if you find a way to train it on your own, please let me know. I struggle on Yolo for several weeks
it takes so long time. in get_evaluation_bboxes, it takes so much time to run below code, (about more than 10 hours) for idx in range(batch_size): nms_boxes = non_max_suppression( bboxes[idx], iou_threshold=iou_threshold, threshold=threshold, box_format=box_format, )