Open yangulei opened 5 years ago
@yangulei There are some things you can do in this situation. First you can augment your images to get more to data to train on. You can use weights from models trained on ImageNet, or any other dataset, which this repo and pjreddie.com provides you with, this is called transfer learning. You can use hyper parameter tuning, i.e. making changes in cfg file. Joseph Redmon outlines how he does this in his research papers, for instance increasing height and width. Also ideally your dataset should have an equal distribution of the classes you want to train your model to recognize.
@yangulei
Can you show output anchors and cloud.png
that you get by using command?
./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 -show
Can you attach your cfg-file (renamed to txt-file)?
Do you want to Detect objects on images or on video (file, camera, ...)?
Label the images in the selected dataset (540 images), which hurts my eyes seriously. Split the selected dataset to train/valid dataset (435 vs. 105) and start to train the model. I want to train a yolo detector to detect ‘bus’, ‘car’ and ‘truck’ in the videos recorded by a drone.
Did you use Yolo mark? https://github.com/AlexeyAB/Yolo_mark For good result you should have about 2000 images per class, i.e. at about 6000 images :) https://github.com/AlexeyAB/darknet#how-to-improve-object-detection
So collect more images and do data augmentation - rotation, because rotation augmentation isn't implemented in Yolo yet.
for each object which you want to detect - there must be at least 1 similar object in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination. So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds - you should preferably have 2000 different images for each class or more, and you should train 2000*classes iterations or more
@JakupGuven @AlexeyAB Thanks for your reply and suggestions.
Do you want to Detect objects on images or on video (file, camera, ...)?
My goal is to train a yolo detector to detect 'bus', 'car, and 'truck' in the frames taken by a camera mounted on a drone. The detector model will be deployed on the NVIDIA Jetson TX2(i) and flying with the drone. Refer to deploying of the model, I'll use NVIDIA TensorRT, which will speed up the inference about 2 to 3 times.
Can you attach your cfg-file (renamed to txt-file)?
Here is the cfg-file I used for training, which is an early prototype and be customized with several consideration:
Can you show output anchors and cloud.png that you get by using command?
The input size of my model is 960*640, and there are only 2 "yolo" layers, so I calculate the anchors using the command:
darknet detector calc_anchors data\drone.data -num_of_clusters 6 -width 960 -height 640 -show
and I got this:
and this:
Did you use Yolo mark? https://github.com/AlexeyAB/Yolo_mark
I use LabelImg. https://github.com/tzutalin/labelImg
For good result you should have about 2000 images per class, i.e. at about 6000 images :)
This is the point that confuse me. Why the number of images matters, instead of the number of objects in the images? There might be dozens even hundreds of objects in a single frame:
It's really a huge work to label 6000 images! I know 540 images are far from enough, I will collect more images, but my point is how to reduce the imbalance of the dataset at the same time.
do data augmentation - rotation, because rotation augmentation isn't implemented in Yolo yet.
That's a good idea, I'll check whether the rotated frames similar to some real scene. Does the "angle" parameter in the cfg-file mean rotation augmentation? I find out a function that seems doing the job: https://github.com/AlexeyAB/darknet/blob/099b71d1de6b992ce8f9d7ff585c84efd0d4bf94/src/image.c#L1005-L1024
@yangulei
Set num=6
for both [yolo] layers in cfg-file, since you use only 6 anchors.
I'll check whether the rotated frames similar to some real scene.
Yes, maybe at your shooting angle the rotation-augmentation is possible only +-15 degree.
Does the "angle" parameter in the cfg-file mean rotation augmentation?
Yes. But it works only for training Classificator currently.
This is the point that confuse me. Why the number of images matters, instead of the number of objects in the images? There might be dozens even hundreds of objects in a single frame:
What matters is the number of objects and the number of backgrounds. So you should collect more images (even if there are no objects) to get more backgrounds.
My goal is to train a yolo detector to detect 'bus', 'car, and 'truck' in the frames taken by a camera mounted on a drone. The detector model will be deployed on the NVIDIA Jetson TX2(i) and flying with the drone. Refer to deploying of the model, I'll use NVIDIA TensorRT, which will speed up the inference about 2 to 3 times.
What soft(repository) do you use for detection by using TensorRT?
Also you should compare different trained models by Accuracy/Detection_Time
. There are LSTM-Convolutional networks which can detect on video much better than usual Convolutional networks ~1.5x higher mAP on Video.
Did you compare Accuracy/Detection_Time
for [maxpool] stride=2
instead of [convolutional] stride=2
in your small model?
In you small model, each final activation has receptive field about ~160x160 pixels, so I think it should be enough for your small objects.
@AlexeyAB
Set num=6 for both [yolo] layers in cfg-file, since you use only 6 anchors.
Oh, you are right, I just forgot that. I'll correct this and train my model again, thanks for pointing that out.
maybe at your shooting angle the rotation-augmentation is possible only +-15 degree.
If I do the rotation-augmentation myself, how to calculate the labels in the augmented images? I'm afraid the boundingbox of the rotated rectangle will be bigger than the ground truth more or less. Do you have better idea?
What soft(repository) do you use for detection by using TensorRT?
I write a simplified yolo parser refer to the deepstream reference apps . I doesn't use any plugins but only the layers officially supported by TensorRT.
Also you should compare different trained models by Accuracy/Detection_Time. There are LSTM-Convolutional networks which can detect on video much better than usual Convolutional networks ~1.5x higher mAP on Video.
So far, I'm following the tracking-by-detection scheme. I agree that LSTM-Convolutional networks should be better, but I don't have enough understanding of LSTM for now. I learn CNN mainly through Stanford CS231n, do you have any suggestions of courses or books for learning LSTM?
Did you compare Accuracy/Detection_Time for [maxpool] stride=2 instead of [convolutional] stride=2 in your small model?
Not yet. In fact, I combine a [maxpool] and the [convolutional] after it to a single [convolutional] with stride=2, because personally I don't like [maxpool]. In my opinion, the downsample strategy should be learned during the training process, instead of being set artificially. I also noticed that the [maxpool] + [convolutional] in yolov2.cfg are replaced by [convolutional] with stride=2 in yolov3.cfg too.
Back to my original concern, how to reduce the imbalance while enriching the dataset? Does this question have an answer, or it will not exists when the dataset is rich enough (by augmentation and/or by labeling)?
You'll probably get better results with simple upsampling. This paper https://arxiv.org/pdf/1710.05381.pdf from October found consistently better results in visual object detection tasks by upsampling to parity. You could further enhance by tweaking the class thresholds before the softmax to reflect the expected distribution in the population. ie. usually you take whichever prediction is highest to be the most probable class but if you know that cars are much more common than buses then you might set thresholds as: [Car: 0.05, Bus: 0.2] and then you would interpret a probability vector [Car: 0.15, Bus: 0.17] to be a prediction for a Car. [also described in more detail in the afore-linked paper]
@yangulei
If I do the rotation-augmentation myself, how to calculate the labels in the augmented images? I'm afraid the boundingbox of the rotated rectangle will be bigger than the ground truth more or less. Do you have better idea?
No, I don't have any ideas ) Just may be if it will be very small rotations, then bbox will be a little bit bigger.
Not yet. In fact, I combine a [maxpool] and the [convolutional] after it to a single [convolutional] with stride=2, because personally I don't like [maxpool]. In my opinion, the downsample strategy should be learned during the training process, instead of being set artificially. I also noticed that the [maxpool] + [convolutional] in yolov2.cfg are replaced by [convolutional] with stride=2 in yolov3.cfg too.
In the big full yolov3.cfg the conv-stride=2 is used, but in the small yolov3-tiny.cfg the maxpool-stride=2 is used.
I used very few experiments with yolov3-tiny.cfg with conv-stride=2
, and it seems you need to train it much longer.
Back to my original concern, how to reduce the imbalance while enriching the dataset? Does this question have an answer, or it will not exists when the dataset is rich enough (by augmentation and/or by labeling)?
In the optimizer, it is already solve as much as possible by using decay
https://github.com/AlexeyAB/darknet/issues/1845#issuecomment-434079699
In the most cases focal_loss=1
in the [yolo]
layer doesn't help to solve imbalance (that is used in RetinaNet).
So you just should add more images, especially with buses and trucks.
So far, I'm following the tracking-by-detection scheme. I agree that LSTM-Convolutional networks should be better, but I don't have enough understanding of LSTM for now. I learn CNN mainly through Stanford CS231n, do you have any suggestions of courses or books for learning LSTM?
What Tracker do you use? No, I don't have sugestion of a good book/course. Just you can try to start from https://en.wikipedia.org/wiki/Long_short-term_memory And if you have enough time https://arxiv.org/pdf/1506.04214v2.pdf In several days I will add Conv-LSTM layers and model for Detection, with a description, currently I test it.
I am facing the same problem (unbalanced dataset) - here some things i want to try out
I will try and see how this helps, i think these are not bad ideas.
@LukeAI Thanks for sharing your ideas.
You'll probably get better results with simple upsampling. ...
I think you mean oversampling in the paper. As it said in the paper:
The main idea is to ensure uniform class distribution of each mini-batch and control the selection of examples from each class.
This is applied by selecting more samples with minority classes during a mini-batch, which is straightforward for classification task. But I can't figure out how to apply this in a detection task. The samples of all classes are embed in the same image, and I don't have enough images in-which there are more "bus" or "truck" than "car", so I don't know how to balance the class distribution for a mini-batch.
@AlexeyAB
What Tracker do you use?
I'm using a C++ implementation of SORT.
In several days I will add Conv-LSTM layers and model for Detection, with a description, currently I test it.
Amazing, looking forward for that. : )
@holger-prause Looking foreword to your updates. : )
I think you mean oversampling in the paper. As it said in the paper:
yeah good point :)
I can't figure out how to apply this in a detection task. The samples of all classes are embed in the same image, and I don't have enough images in-which there are more "bus" or "truck" than "car", so I don't know how to balance the class distribution for a mini-batch.
I see what you mean... just throwing an idea out there but possibly you could write a script to crop out regions with lots of cars only and append that to the original dataset (and fill in with blackness) as a crude way to balance things out a bit more? Another approach mentioned in that paper is to tune the softmax thresholds to try to compensate for the bias in the model resulting from the imbalance? What do you think?
Wow i think the idea to balance the !minibatch! is the way to go! This way you make sure the model wont "forget" about "irrelevant" samples. Well i don know how to do this in yolo (custom loss function? - hmm i dont think so) - i guess you would need to adopt the code which read in the samples?
So the idea is not to balance the dataset but to balance what your model sees during training? I think that again makes sense for me - thank you guys very much - this thread is good!
Good this problem is solveable :-)
I'm not sure I can see why balancing every mini-batch is going to have a different result to the conventional, more straightforward approach of balancing the whole dataset and shuffling. I'm ready to be proved wrong though, if you get any results please do update us.
@LukeAI
possibly you could write a script to crop out regions with lots of cars only and append that to the original dataset (and fill in with blackness) as a crude way to balance things out a bit more?
This is a good idea, just like @holger-prause said. But we need to find a better way to merge the cropped objects and the background, it's too artificial for now. And I got another idea: copy the sample images and blur the objects with majority classes, and maybe additional augmentation, to make more samples for the minority classes.
Another approach mentioned in that paper is to tune the softmax thresholds to try to compensate for the bias in the model resulting from the imbalance? What do you think?
I don't think this could affect the training process, unless we could find a way to add this logic to the loss function.
I don't think this could affect the training process, unless we could find a way to add this logic to the loss function.
No it wouldn't affect the training - this is for inference time. This paper suggests that best results are with a balanced dataset and inference-time thresholds adjusted to reflect the population distribution but mentions that reasonable results have been obtained by others by using an unbalanced training set and using this technique to try to offset that bias at inference time.
Related question:
In my previous trainings, I have generally tried to balance frequency of instances of each class - as opposed to number of images containing each class. eg. If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2. Does this sound like the best approach?
@LukeAI
If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2.
What is "oversample"?
@LukeAI
If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2.
What is "oversample"?
Using duplicate copies of minority classes so that during training, the network is exposed to roughly the same number of examples each class, so as to avoid bias. A classic extreme example is if you are training a network to identify fraudulent bank transactions. Almost all bank examples are not fraudulent so if you use a representative sample of all bank transactions for training, your network will probably just learn to predict that all transactions are non-fraudulent because that prediction was correct 99.9999% of the time during training.
I do this with darknet by putting multiple entries of images containing minority classes into train.txt and test.txt
@LukeAI Yes, it's a good solution.
@AlexeyAB I am trying to implement weapon detection and facing rotation related issues. I am thinking of rotating each image by multiple of 30. So 12 versions of the same image.
If this can be handled in the cfg file that would be wonderful.
I want to train a yolo detector to detect ‘bus’, ‘car’ and ‘truck’ in the videos recorded by a drone. Here is what I did so far:
Then I get the problem: the recall in the training process reach about 100% quickly, but the mAP is only about 25% and decreasing with more training steps. I guess it’s the so-called overfitting problem. I also noticed that the selected dataset is imbalanced badly, the number of ‘bus’, ‘car’ and ‘truck’ is 0.6k, 20k and 1.4k respectively. I’m going to select more images with relative low confidences to enrich my dataset, following the concept of active learning. But I don’t know how to deal with the imbalance of the dataset. Does anyone have some ideas?