Results of the Lastest commit (4892071) and latest cfg files of the YOLO v2

MyVanitar commented 7 years ago

Hello,

The Yolo authors has changed the YOLO-VOC CFG files in their original website. I mean here:

2017-08-13_10-51-46

Training with this new CFG files (with either thresh = 0.2 or 0.001) does not handle the results as when I had trained it using older repo and older YOLO-VOC CFG file. I can say results are way worse than before.

Besides, training with YOLO-VOC-2.0 handles slightly better results but still it can not compete with older results (with either thresh = 0.2 or 0.001).

I have used the latest commit of the repo here (4892071). What is the problem? the new CFG files or repo codes?

MyVanitar commented 7 years ago

Training with this new CFG files (with either thresh = 0.2 or 0.001) does not handle the results as when I had trained it using older repo and older YOLO-VOC CFG file. I can say results are way worse than before.

I must say training results with this CFG file was zero for IoU and Recall and no detection. it just worked with YOLO-VOC-2.0.CFG, but the calculated results was still not close to the older commits. Either with thresh = 0.001 or thresh = 0.2, I could not get better than IoU = 71% on my own Dataset which is far below IoU = 75%, on the older commits and older CFG file.

AlexeyAB commented 7 years ago

@VanitarNordic The main diff in both cfg-file and the ways of training in original Linux repo. This repo doesn't support to train yolo-voc.cfg, for example desn't param burn_in=1000. Also, for comparison, you should use 100% identical cfg files, which repeatedly changed a little bit. And you have to compare the old commit and the new commit right now, but do not have to rely on your memory. Even little change of params: random, width, height, subdivision, batch - can has effect.

MyVanitar commented 7 years ago

All parameters are identical, such as batch random, resolution or others. I have both commits on my hard drive and I test based on validation IoU and Recall results.

if nothing has changed in the codes except than the threshold, then training with yolo-voc-2.0.cfg and threshold = 0.001 should NOT handle worse validation IoU and Recall results. if the intention is enhance the model not to make it worse by using a new CFGs.

So what remains suspicious here? maybe the code.

AlexeyAB commented 7 years ago

@VanitarNordic

Also earlier the valid dataset was taken from data/voc.2007.test but now it will be taken from valid= in data-file, and if it is absant then will be taken from data/train.txt: https://github.com/AlexeyAB/darknet/commit/97ed11ca1503953199495e9b4386974ceba44687#diff-d77fa1db75cc45114696de9b1c005b26L371

So you can just copy function validate_detector_recall() from old commit to the new commit to be sure that these validation is identical. And call here as validate_detector_recall(cfg, weights); https://github.com/AlexeyAB/darknet/blame/master/src/detector.c#L552

Then show such table with the same threshold = 0.001 and with the same cfg-file:

---	IoU	IoU
---	Trained on old commit	Trained on new commit
Tested on old commit	---	---
Tested on new commit	---	---

MyVanitar commented 7 years ago

I replaced the function validate_detector_recall(). it compiles successfully but it crashes when I try to validate the model. I have copied the voc.2007.test also in place.

MyVanitar commented 7 years ago

@AlexeyAB

You did not reply about the crashing problem, so in the meantime I trained and tested the old commit using the yolo-voc-2.0.cfg. The result is a new record and very good, 76% IoU and good detection.

I think you have made some modifications in the code, but have not tested the result if these modifications will be good.

2017-08-13_23-17-19

AlexeyAB commented 7 years ago

I replaced the function validate_detector_recall(). it compiles successfully but it crashes when I try to validate the model. I have copied the voc.2007.test also in place.

I didn't know why it crushes in your case, it should works with the same IoU. Only after you get such a table, I can say something:

---	IoU	IoU
---	Trained on old commit	Trained on new commit
Tested on old commit	---	---
Tested on new commit	---	---

MyVanitar commented 7 years ago

CFG file in all training and testing experiments were yolo-voc-2.0.cfg

Trained and tested on the old commit: 76.20% Trained and tested on the last commit: 75.60% (Identical recall function as old commit - crash was because of the remained extra argument in the function call in few lines after, which must be modified)

Recall of the last commit = Recall of the old commit = 90.38%

MyVanitar commented 7 years ago

Tested on the last commit, but used weight was from the old commit = 76.22% Tested on the old commit, but used weight was from last commit = 75.34%

Therefore we can say the weights of the old commit has trained slightly better (around 1%). Recall function has no issue.

Now you have your table.

AlexeyAB commented 7 years ago

So now we can say, that trained slightly better on old commit:

---	IoU	IoU
---	Trained on old commit	Trained on new commit
Tested on old commit	76.20%	75.34%
Tested on new commit	76.22%	75.60%

Do you use the same CUDA in both, and did you compile with cuDNN both new/old commits?

What files was changed are: data.c and blas_kernels.cu.

So you can get these files from old commit by this url and put these files into new commit - then train and test again:

Replacing these files one by one and train&test again, you can find out which of them has an effect on the IoU.

in data.c was commented 2 lines with srand(time(0));: https://github.com/AlexeyAB/darknet/commit/815e7a127b062aa8bc4f4ba7af2cfd97c232f34c#diff-2ceac7e68fdac00b370188285ab286f7
in blas_kernels.cu was added __syncthreads(); in 4 lines: https://github.com/AlexeyAB/darknet/commit/9920410ba9cc756c46d6ee84f7b7a2a9fe941448#diff-14ecc558a5571a7cffc7b588b155d013

MyVanitar commented 7 years ago

I'll test this and tell you the results, but before that I wanted to inform you something else.

During training, sometimes randomly I face this -nan issue, both in old and new commits.

2017-08-13_21-22-04

MyVanitar commented 7 years ago

Results:

Replacing data.c from older commit: IoU=74.78% Recall=88.46%
Replacing blas_kernels.cu from older commit (data.c had also replaced already in the previous step): IoU=75.87% Recall=90.38%

MyVanitar commented 7 years ago

Do you use the same CUDA in both, and did you compile with cuDNN both new/old commits?

Yes, I use the same CUDA and both compiled using CuDNN. Actually CuDNN boosts at least 50% in training speed. But I saw no difference between 5.1 and 6.

MyVanitar commented 7 years ago

Now I keep the data.c unchanged in the last commit and train again just by replacing the blas_kernels.cu. This would be the final test.

MyVanitar commented 7 years ago

I did the final test: IoU= 75.24% Recall=90.38%

I got confused. if you have more suggestions, please welcome.

AlexeyAB commented 7 years ago

So, no ideas. Theoretically, a 1 percent difference can be a random fluctuation.

MyVanitar commented 7 years ago

You know modern models fight for the 1st place for lower than 1 percent in difference.

YOLO has many unknown sides such as accuracy calculation, mAP and anchors calculation which just we have to test by trial and fail method.

What do you think of SSD-300 (7++12+COCO)? is it better than YOLO?

Also, I want to clarify that these discussions are to learn something and make the model better, not a personal war. At least from my side. I learned many many things of you and I never forget. I hope you don't get these discussions personal. these are scientific debates and are normal between scientists

AlexeyAB commented 7 years ago

SSD-300 (7++12+COCO) is faster and more accurate than Yolo v2 (7++12+COCO). So probably if you have large dataset and small objects, then SSD-300 more preferably to use. Did you try to train SSD-300/512 successfully?

Why do I use Yolo v2:

it has small dependecies, and can be implemented on FPGA (then it can be implemented in ASIC)
source code can be easily changed/added using C/CUDA C++
code can be explained to people who know only C, and don't know C++ and Python as in Caffe+SSD
also there is Yolo 9000

I perceive discussions normally only if several conditions are met:

If I have time for them
No knowingly false assertions
My questions are also answered in full
The discussion does not take the form of a polemic

MyVanitar commented 7 years ago

I trained again on the old repo and I could met the IoU=76.25% like before.

Therefore we can assume that there is a minor issue somewhere. Actually I did training on the old commit 3 times and the results of all were equal. may I ask you to re-think what can cause this?

Regarding SSD, Yes I had fine-tuned it some times ago and I showed good mAP (it was calculating), but it was not easy to work as YOLO. besides its original repo is the best one and written just for Linux and Caffe. There is a windows distribution of the Caffe, but SSD author has modified the original Caffe. I'm not quite sure if there is another good implementation of SSD in Keras, Tensorflow or whatever.

MyVanitar commented 7 years ago

And actually I go to fine-tune a model rather than scratch training, because somebody with many GPUs has trained it before and we can use those weights. that's why I believe we are not fine-tuning the YOLO. We train from scratch with an initial classification weights.

But you made a very very good repo of the YOLO actually, That's the reason I try and train day and night to make it better. I don't want to show something for my coursework like many others. That's why I comment much more than other people :-). Your knowledge is invited and correct and you are very professional in C/C++.

beside YOLO is very memory efficient. SSD-300 was easily casing overflow on my 6G GPU and I had to reduce the images' size too much.

Also I have not tested YOLO-9000. maybe it is more accurate than YOLO v2.

MyVanitar commented 7 years ago

really I got tired of training and testing the last commit and you don't want to a little bit consider that might be true, in contrast you think all comments are false assertion!. for what benefit I don't know.

The last commit does not train as good this: https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7

AlexeyAB commented 7 years ago

False assertions only:

that thresh in recall is not the same as thresh in the test
that we should calculate relative coords by dividing by max(w,h)

Based on your training tests - a slight decrease IoU (~1%) due to some changes in new commits - this problem is present, I agree. Do you know how to fix this problem?

MyVanitar commented 7 years ago

No man, I have not changed the coordinates :-). That was a different story. threshold is also set as the old ones. 0.001.

The thing is also when I test the trained model, I can see the difference in detection, not huge, but you can see the effect of that percentages.

sometimes this difference goes higher, maybe 2 or even 3%, but it is not stable. but in the commit which I linked you, if you train 10 times with the same settings, all results are equal with not even a penny change.

I tried many many things. actually I work on it day and night. but I have no idea really. The thing is it does not handle a unique result each time but the old commit does. therefore I suspect the issue is from somewhere which affects the training.

MyVanitar commented 7 years ago

Also, I must give you a big thumbs up and all of my credits, because today I trained it also on the original Darknet repo and Linux, but the result was not good as yours. it means your repo is better in all aspects. Therefore it is shame for this very small issue and I'm trying to find a solution.

AlexeyAB commented 7 years ago

Thank you. Can you show when cfg based on yolo-voc.2.0.cfg:

what IoU when trained on original Linux and tested recall on original Linux?
what IoU when trained on original Linux and tested recall on this Windows (if set thresh = 0.001)?

MyVanitar commented 7 years ago

I started training on Ubuntu, but it seems there is a problem in training and random functions, you know the count value, see the picture. Do you want that I continue? I used the latest commit. (d3577a5)

screenshot from 2017-08-18 02-48-49

AlexeyAB commented 7 years ago

Yes, I know, I will comment this srand() again later.

I mean what IoU if you train used original Linux repo: https://github.com/pjreddie/darknet And then test result weights

on original Linux: https://github.com/pjreddie/darknet
on my Windows: https://github.com/AlexeyAB/darknet

MyVanitar commented 7 years ago

And yes, you said about original Linux repo. above comment is your repo. The original linux repo I could not see the recall because it was recording the results as a report and IoU was not calculated there, but from the detections I thought there is a significant difference in IoU between Linux and your repo. I mean Linux was worse.

And yes, I brought the trained weight from the original Linux to windows to test it, but it showed 0 for both IoU and Recall.

MyVanitar commented 7 years ago

and both in Linux and Windows I had yolo-voc.2.0.cfg in place to train and test.

AlexeyAB commented 7 years ago

The original linux repo I could not see the recall because it was recording the results as a report and IoU was not calculated there

I.e. original Linux repo doesn't show IoU when called ./darknet detector recall ...?

I did fixes 5 minutes ago: https://github.com/AlexeyAB/darknet/commit/4d2fefd75a57dfd6e60680eaf7408c82e15a025d So you can try to train on Linux using this repo with last commit: https://github.com/AlexeyAB/darknet/

Also do you use valid-dataset the same as training-dataset, or valid-txt differ than train-txt files in the obj.data? https://github.com/AlexeyAB/darknet/blob/master/cfg/voc.data#L3

MyVanitar commented 7 years ago

Also do you use valid-dataset the same as training-dataset, or valid-txt differ than train-txt files in the obj.data? https://github.com/AlexeyAB/darknet/blob/master/cfg/voc.data#L3

of course validation dataset is different from training. I think that's the rule to test a typical model. Besides the train.txt and the valid.txt were identical for all experiments. Also, the voc.2007.test was identical to valid.txt .

I.e. original Linux repo doesn't show IoU when called ./darknet detector recall ...

I think I used the valid parameter. I have the repo. I'll make another test with recallparameter.

l will train your Linux repo tomorrow and tell you the results.

MyVanitar commented 7 years ago

Excuse me, the above result was on the Linux repo with on thresh = 0.2, I just forgot to change it like windows before I make it. Your Linux repo can achieve even significantly higher IoU.

Your Linux Repo outperforms the Original Darknet Linux Repo by 2.33%. Excellent.

Therefore I summarize your latest Linux repo (4d2fefd) as this:

Threshold = 0.2 : IoU = 72.38% , Recall = 84.62%
Threshold = 0.001: IoU = 76.33% , Recall = 88.46%

Therefore we can suspect the issue in the Windows side

Also to have all results in one place:

Original Darknet Linux Repo: IoU = 74% , Recall = 88.4%

MyVanitar commented 7 years ago

This is the result of the old windows repo (https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7):

IoU = 76.22% , Recall = 90.33%

Therefore we can say the result of this old windows commit, and last Linux commit are almost identical, except that this old windows Repo is 2% better in terms of the Recall.

I think now you have the clue.

AlexeyAB commented 7 years ago

Yes, there is a difference.

Also to have all results in one place:

Original Darknet Linux Repo: IoU = 74% , Recall = 88.4%

Also can you test on Windws the weights already trained on Original Darknet Linux Repo, will it be the same IoU = 74% , Recall = 88.4%?

MyVanitar commented 7 years ago

let me gather all in one place:

Trained weights on the Original Darknet repo, tested on Alexey Windows Repo (thresh=0.001):

Old Windows repo: IoU = 0% , Recall = 0% Latest Windows Repo: IoU = 74% , Recall = 88.44%

Therefore:

The results of the latest windows repo (weights from Original Darknet) = Results of the original Darknet repo

You are right

MyVanitar commented 7 years ago

Do you have any clue?

AlexeyAB commented 7 years ago

No, I havn't.

MyVanitar commented 7 years ago

tracing from this commit https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7 till the latest commit might give you a clue.

MyVanitar commented 7 years ago

Hi Alex, I have a good news for you.

I recompiled the old commit using OpenCV 2.4.13 (which was 2.4.9 before) and CUDA 8.0.61 + Patch-2 (which was CUDA 8.0.61 before) and CuDNN 6 (Which was CuDNN 5.1 before)

You know what? The same results as the latest commit!!!!!!!! (I have not tested 4d2fefd).

Now I suspect one of these peace of shit CUDA or CuDNN. if OpenCV also influences the training or whatever, then that can cause this effect.

The interesting is that I had used the same setting in Linux for your repo, I mean (CUDA 8.0.61 + patch-2, CuDNN 6), but good results.

AlexeyAB commented 7 years ago

@VanitarNordic Hi, I think that this is due to CUDA and cuDNN.

OpenCV loads images, so it can have effect on image quality: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/image.c#L599 You can simply change this line image out = load_image_cv(filename, c); to this image out = load_image_stb(filename, c); here to disable opencv effects - But I think that the OpenCV is not to blame: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/image.c#L1312

MyVanitar commented 7 years ago

Should I change both Line 599 and 1312?

You have described about the line 1312.

AlexeyAB commented 7 years ago

You should only change line 1312

MyVanitar commented 7 years ago

I tested the OpenCV. that not cause the issue. next step is to test CuDNN.

MyVanitar commented 7 years ago

is there anywhere in the code is related to the cuBLAS?

AlexeyAB commented 7 years ago

Yes, cublas used for gemm if CUDNN=0: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/gemm.c#L180

MyVanitar commented 7 years ago

Actually I tested it with CuDNN5.1 and the results was the same. therefore the only thing remained is Cuda patch. I should remove the Cuda and install it again without applying the Patch-2 and see if it generates the issue.

MyVanitar commented 7 years ago

Alright, I tested the code but it did not solve the case. if you know some other alternative issues that could affect the Visual Studio, please just let me know to test. It happened after I recompiled it.

nVidia Display Driver could affect? (it is not the same as it comes with Cuda)

AlexeyAB commented 7 years ago

No thoughts on this matter.

MyVanitar commented 7 years ago

Alright. The story finished.

I downloaded OpenCv 2.4.9 and recompiled and trained with it. The same results appeared. IoU = 76.18% and Recall = 90.38%

I did not amaze actually because I had faced these OpenCV issues before from version to version. One function was showing different behaviors just by changing the versions. Besides it shows OpenCV has more impact than we had imagined.

Besides 2% more in Recall in your Windows repo is possibly because the Rand_s() function creates better quality random numbers than rand() which is true.

Now I'll update the CUDA and CuDNN to their last versions and retest.

MyVanitar commented 7 years ago

I have a question. The IoU value is most important or Recall? For example if 1% increase in IoU causes 2% decrease in Recall, do you prefer?

AlexeyAB / darknet

Results of the Lastest commit (4892071) and latest cfg files of the YOLO v2 #156