Closed MyVanitar closed 7 years ago
Training with this new CFG files (with either thresh = 0.2 or 0.001) does not handle the results as when I had trained it using older repo and older YOLO-VOC CFG file. I can say results are way worse than before.
I must say training results with this CFG file was zero for IoU and Recall and no detection. it just worked with YOLO-VOC-2.0.CFG
, but the calculated results was still not close to the older commits. Either with thresh = 0.001
or thresh = 0.2
, I could not get better than IoU = 71%
on my own Dataset which is far below IoU = 75%
, on the older commits and older CFG file.
@VanitarNordic The main diff in both cfg-file and the ways of training in original Linux repo. This repo doesn't support to train yolo-voc.cfg, for example desn't param burn_in=1000. Also, for comparison, you should use 100% identical cfg files, which repeatedly changed a little bit. And you have to compare the old commit and the new commit right now, but do not have to rely on your memory. Even little change of params: random, width, height, subdivision, batch - can has effect.
All parameters are identical, such as batch random, resolution or others. I have both commits on my hard drive and I test based on validation IoU and Recall results.
if nothing has changed in the codes except than the threshold, then training with yolo-voc-2.0.cfg
and threshold = 0.001
should NOT handle worse validation IoU and Recall results. if the intention is enhance the model not to make it worse by using a new CFGs.
So what remains suspicious here? maybe the code.
@VanitarNordic
Also earlier the valid dataset was taken from data/voc.2007.test
but now it will be taken from valid=
in data-file, and if it is absant then will be taken from data/train.txt
: https://github.com/AlexeyAB/darknet/commit/97ed11ca1503953199495e9b4386974ceba44687#diff-d77fa1db75cc45114696de9b1c005b26L371
So you can just copy function validate_detector_recall()
from old commit to the new commit to be sure that these validation is identical. And call here as validate_detector_recall(cfg, weights);
https://github.com/AlexeyAB/darknet/blame/master/src/detector.c#L552
Then show such table with the same threshold = 0.001
and with the same cfg-file:
--- | IoU | IoU |
---|---|---|
--- | Trained on old commit | Trained on new commit |
Tested on old commit | --- | --- |
Tested on new commit | --- | --- |
I replaced the function validate_detector_recall()
. it compiles successfully but it crashes when I try to validate the model. I have copied the voc.2007.test
also in place.
@AlexeyAB
You did not reply about the crashing problem, so in the meantime I trained and tested the old commit using the yolo-voc-2.0.cfg
. The result is a new record and very good, 76% IoU and good detection.
I think you have made some modifications in the code, but have not tested the result if these modifications will be good.
I replaced the function validate_detector_recall(). it compiles successfully but it crashes when I try to validate the model. I have copied the voc.2007.test also in place.
I didn't know why it crushes in your case, it should works with the same IoU. Only after you get such a table, I can say something:
--- | IoU | IoU |
---|---|---|
--- | Trained on old commit | Trained on new commit |
Tested on old commit | --- | --- |
Tested on new commit | --- | --- |
CFG file in all training and testing experiments were yolo-voc-2.0.cfg
Trained and tested on the old commit: 76.20%
Trained and tested on the last commit: 75.60%
(Identical recall function as old commit - crash was because of the remained extra argument in the function call in few lines after, which must be modified)
Recall of the last commit = Recall of the old commit = 90.38%
Tested on the last commit, but used weight was from the old commit = 76.22%
Tested on the old commit, but used weight was from last commit = 75.34%
Therefore we can say the weights of the old commit has trained slightly better (around 1%). Recall function has no issue.
Now you have your table.
So now we can say, that trained slightly better on old commit:
--- | IoU | IoU |
---|---|---|
--- | Trained on old commit | Trained on new commit |
Tested on old commit | 76.20% | 75.34% |
Tested on new commit | 76.22% | 75.60% |
Do you use the same CUDA in both, and did you compile with cuDNN both new/old commits?
What files was changed are: data.c and blas_kernels.cu.
So you can get these files from old commit by this url and put these files into new commit - then train and test again:
data.c
https://raw.githubusercontent.com/AlexeyAB/darknet/a6cbaeecde40f91ddc3ea09aa26a03ab5bbf8ba8/src/data.cblas_kernels.cu
https://raw.githubusercontent.com/AlexeyAB/darknet/a6cbaeecde40f91ddc3ea09aa26a03ab5bbf8ba8/src/blas_kernels.cuReplacing these files one by one and train&test again, you can find out which of them has an effect on the IoU.
in data.c was commented 2 lines with srand(time(0));
: https://github.com/AlexeyAB/darknet/commit/815e7a127b062aa8bc4f4ba7af2cfd97c232f34c#diff-2ceac7e68fdac00b370188285ab286f7
in blas_kernels.cu
was added __syncthreads();
in 4 lines: https://github.com/AlexeyAB/darknet/commit/9920410ba9cc756c46d6ee84f7b7a2a9fe941448#diff-14ecc558a5571a7cffc7b588b155d013
I'll test this and tell you the results, but before that I wanted to inform you something else.
During training, sometimes randomly I face this -nan
issue, both in old and new commits.
Results:
data.c
from older commit: IoU=74.78% Recall=88.46%
blas_kernels.cu
from older commit (data.c
had also replaced already in the previous step): IoU=75.87% Recall=90.38%
Do you use the same CUDA in both, and did you compile with cuDNN both new/old commits?
Yes, I use the same CUDA and both compiled using CuDNN. Actually CuDNN boosts at least 50% in training speed. But I saw no difference between 5.1 and 6.
Now I keep the data.c
unchanged in the last commit and train again just by replacing the blas_kernels.cu
. This would be the final test.
I did the final test: IoU= 75.24% Recall=90.38%
I got confused. if you have more suggestions, please welcome.
So, no ideas. Theoretically, a 1 percent difference can be a random fluctuation.
You know modern models fight for the 1st place for lower than 1 percent in difference.
YOLO has many unknown sides such as accuracy calculation, mAP and anchors calculation which just we have to test by trial and fail method.
What do you think of SSD-300 (7++12+COCO)? is it better than YOLO?
Also, I want to clarify that these discussions are to learn something and make the model better, not a personal war. At least from my side. I learned many many things of you and I never forget. I hope you don't get these discussions personal. these are scientific debates and are normal between scientists
SSD-300 (7++12+COCO) is faster and more accurate than Yolo v2 (7++12+COCO). So probably if you have large dataset and small objects, then SSD-300 more preferably to use. Did you try to train SSD-300/512 successfully?
Why do I use Yolo v2:
I perceive discussions normally only if several conditions are met:
I trained again on the old repo and I could met the IoU=76.25%
like before.
Therefore we can assume that there is a minor issue somewhere. Actually I did training on the old commit 3 times and the results of all were equal. may I ask you to re-think what can cause this?
Regarding SSD, Yes I had fine-tuned it some times ago and I showed good mAP (it was calculating), but it was not easy to work as YOLO. besides its original repo is the best one and written just for Linux and Caffe. There is a windows distribution of the Caffe, but SSD author has modified the original Caffe. I'm not quite sure if there is another good implementation of SSD in Keras, Tensorflow or whatever.
And actually I go to fine-tune a model rather than scratch training, because somebody with many GPUs has trained it before and we can use those weights. that's why I believe we are not fine-tuning the YOLO. We train from scratch with an initial classification weights.
But you made a very very good repo of the YOLO actually, That's the reason I try and train day and night to make it better. I don't want to show something for my coursework like many others. That's why I comment much more than other people :-). Your knowledge is invited and correct and you are very professional in C/C++.
beside YOLO is very memory efficient. SSD-300 was easily casing overflow on my 6G GPU and I had to reduce the images' size too much.
Also I have not tested YOLO-9000. maybe it is more accurate than YOLO v2.
really I got tired of training and testing the last commit and you don't want to a little bit consider that might be true, in contrast you think all comments are false assertion!. for what benefit I don't know.
The last commit does not train as good this: https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7
False assertions only:
Based on your training tests - a slight decrease IoU (~1%) due to some changes in new commits - this problem is present, I agree. Do you know how to fix this problem?
No man, I have not changed the coordinates :-). That was a different story. threshold is also set as the old ones. 0.001.
The thing is also when I test the trained model, I can see the difference in detection, not huge, but you can see the effect of that percentages.
sometimes this difference goes higher, maybe 2 or even 3%, but it is not stable. but in the commit which I linked you, if you train 10 times with the same settings, all results are equal with not even a penny change.
I tried many many things. actually I work on it day and night. but I have no idea really. The thing is it does not handle a unique result each time but the old commit does. therefore I suspect the issue is from somewhere which affects the training.
Also, I must give you a big thumbs up and all of my credits, because today I trained it also on the original Darknet repo and Linux, but the result was not good as yours. it means your repo is better in all aspects. Therefore it is shame for this very small issue and I'm trying to find a solution.
Thank you. Can you show when cfg based on yolo-voc.2.0.cfg:
thresh = 0.001
)?I started training on Ubuntu, but it seems there is a problem in training and random functions, you know the count
value, see the picture. Do you want that I continue? I used the latest commit. (d3577a5)
Yes, I know, I will comment this srand() again later.
I mean what IoU if you train used original Linux repo: https://github.com/pjreddie/darknet And then test result weights
And yes, you said about original Linux repo. above comment is your repo. The original linux repo I could not see the recall because it was recording the results as a report and IoU was not calculated there, but from the detections I thought there is a significant difference in IoU between Linux and your repo. I mean Linux was worse.
And yes, I brought the trained weight from the original Linux to windows to test it, but it showed 0 for both IoU and Recall.
and both in Linux and Windows I had yolo-voc.2.0.cfg
in place to train and test.
The original linux repo I could not see the recall because it was recording the results as a report and IoU was not calculated there
I.e. original Linux repo doesn't show IoU when called ./darknet detector recall ...
?
I did fixes 5 minutes ago: https://github.com/AlexeyAB/darknet/commit/4d2fefd75a57dfd6e60680eaf7408c82e15a025d So you can try to train on Linux using this repo with last commit: https://github.com/AlexeyAB/darknet/
Also do you use valid-dataset the same as training-dataset, or valid-txt differ than train-txt files in the obj.data
? https://github.com/AlexeyAB/darknet/blob/master/cfg/voc.data#L3
Also do you use valid-dataset the same as training-dataset, or valid-txt differ than train-txt files in the obj.data? https://github.com/AlexeyAB/darknet/blob/master/cfg/voc.data#L3
of course validation dataset is different from training. I think that's the rule to test a typical model. Besides the train.txt
and the valid.txt
were identical for all experiments. Also, the voc.2007.test
was identical to valid.txt
.
I.e. original Linux repo doesn't show IoU when called ./darknet detector recall ...
I think I used the valid
parameter. I have the repo. I'll make another test with recall
parameter.
l will train your Linux repo tomorrow and tell you the results.
Excuse me, the above result was on the Linux repo with on thresh = 0.2, I just forgot to change it like windows before I make
it. Your Linux repo can achieve even significantly higher IoU.
Your Linux Repo outperforms the Original Darknet Linux Repo by 2.33%
. Excellent.
Therefore I summarize your latest Linux repo (4d2fefd) as this:
IoU = 72.38% , Recall = 84.62%
IoU = 76.33% , Recall = 88.46%
Therefore we can suspect the issue in the Windows side
Also to have all results in one place:
IoU = 74% , Recall = 88.4%
This is the result of the old windows repo (https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7):
IoU = 76.22% , Recall = 90.33%
Therefore we can say the result of this old windows commit, and last Linux commit are almost identical, except that this old windows Repo is 2%
better in terms of the Recall
.
I think now you have the clue.
Yes, there is a difference.
Also to have all results in one place:
- Original Darknet Linux Repo: IoU = 74% , Recall = 88.4%
Also can you test on Windws the weights already trained on Original Darknet Linux Repo, will it be the same IoU = 74% , Recall = 88.4%
?
let me gather all in one place:
Trained weights on the Original Darknet repo, tested on Alexey Windows Repo (thresh=0.001
):
Old Windows repo: IoU = 0% , Recall = 0%
Latest Windows Repo: IoU = 74% , Recall = 88.44%
Therefore:
The results of the latest windows repo (weights from Original Darknet) = Results of the original Darknet repo
You are right
Do you have any clue?
No, I havn't.
tracing from this commit https://github.com/AlexeyAB/darknet/tree/a71bdd7a83e33f28d91b88551b291627728ee3e7 till the latest commit might give you a clue.
Hi Alex, I have a good news for you.
I recompiled the old commit using OpenCV 2.4.13 (which was 2.4.9 before) and CUDA 8.0.61 + Patch-2 (which was CUDA 8.0.61 before) and CuDNN 6 (Which was CuDNN 5.1 before)
You know what? The same results as the latest commit!!!!!!!! (I have not tested 4d2fefd).
Now I suspect one of these peace of shit CUDA or CuDNN. if OpenCV also influences the training or whatever, then that can cause this effect.
The interesting is that I had used the same setting in Linux for your repo, I mean (CUDA 8.0.61 + patch-2, CuDNN 6), but good results.
@VanitarNordic Hi, I think that this is due to CUDA and cuDNN.
OpenCV loads images, so it can have effect on image quality: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/image.c#L599
You can simply change this line image out = load_image_cv(filename, c);
to this image out = load_image_stb(filename, c);
here to disable opencv effects - But I think that the OpenCV is not to blame: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/image.c#L1312
Should I change both Line 599 and 1312?
You have described about the line 1312.
You should only change line 1312
I tested the OpenCV. that not cause the issue. next step is to test CuDNN.
is there anywhere in the code is related to the cuBLAS
?
Yes, cublas used for gemm if CUDNN=0: https://github.com/AlexeyAB/darknet/blob/4d2fefd75a57dfd6e60680eaf7408c82e15a025d/src/gemm.c#L180
Actually I tested it with CuDNN5.1 and the results was the same. therefore the only thing remained is Cuda patch. I should remove the Cuda and install it again without applying the Patch-2 and see if it generates the issue.
Alright, I tested the code but it did not solve the case. if you know some other alternative issues that could affect the Visual Studio, please just let me know to test. It happened after I recompiled it.
nVidia Display Driver could affect? (it is not the same as it comes with Cuda)
No thoughts on this matter.
Alright. The story finished.
I downloaded OpenCv 2.4.9 and recompiled and trained with it. The same results appeared. IoU = 76.18% and Recall = 90.38%
I did not amaze actually because I had faced these OpenCV issues before from version to version. One function was showing different behaviors just by changing the versions. Besides it shows OpenCV has more impact than we had imagined.
Besides 2%
more in Recall in your Windows repo is possibly because the Rand_s()
function creates better quality random numbers than rand()
which is true.
Now I'll update the CUDA and CuDNN to their last versions and retest.
I have a question. The IoU value is most important or Recall? For example if 1% increase in IoU causes 2% decrease in Recall, do you prefer?
Hello,
The Yolo authors has changed the
YOLO-VOC
CFG files in their original website. I mean here:Training with this new CFG files (with either
thresh = 0.2 or 0.001
) does not handle the results as when I had trained it using older repo and olderYOLO-VOC
CFG file. I can say results are way worse than before.Besides, training with
YOLO-VOC-2.0
handles slightly better results but still it can not compete with older results (with eitherthresh = 0.2 or 0.001
).I have used the latest commit of the repo here (4892071). What is the problem? the new CFG files or repo codes?