Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues

AlexeyAB commented 5 years ago

Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.

https://arxiv.org/abs/1705.06368v3
https://arxiv.org/abs/1506.04214v2
Multi-object Tracking with Neural Gating Using Bilinear LSTM: https://web.engr.oregonstate.edu/~lif/1925.pdf

Think about - can we use Transformer (Vaswani et al., 2017) / GPT2 / BERT for frame-sequences instead of word-sequences https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf and https://vk.com/away.php?to=https%3A%2F%2Farxiv.org%2Fpdf%2F1706.03762.pdf&cc_key=

Or can we use Transformer-XL https://arxiv.org/abs/1901.02860v2 or UNIVERSAL TRANSFORMERS https://arxiv.org/abs/1807.03819v3 for Long-time sequences?

i-chaochen commented 5 years ago

Yes, I was looking this similar thing a few weeks ago. You might find this paper interesting.

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection https://arxiv.org/abs/1903.10172

Mobile Video Object Detection with Temporally-Aware Feature Maps https://arxiv.org/pdf/1711.06368.pdf

source code: https://github.com/tensorflow/models/tree/master/research/lstm_object_detection

AlexeyAB commented 5 years ago

@i-chaochen Thanks! It's new interest state-of-art solution by using conv-LSTM not only for increasing accuracy, but also for speedup.

Did you understand what do they mean here?

Does it mean that there are two models:

f0 (large model 320x320 with depth 1.4x)
f1 (small model 160x160 with depth 0.35x)

And states c will be updated only by using f0 model during both Training and Detection (while f1 will not update sates c)?

We also observe that one inherent weakness of the LSTM is its inability to completely preserve its state across updates in practice. The sigmoid activations of the input and forget gates rarely saturate completely, resulting in a slow state decay where long-term dependencies are gradually lost. When compounded over many steps, predictions using the f1 degrade unless f0 is rerun. We propose a simple solution to this problem by simply skipping state updates when f1 is run, i.e. the output state from the last time f0 was run is always reused. This greatly improves the LSTM’s ability to propagate temporal information across long sequences, resulting in minimal loss of accuracy even when f1 is exclusively run for tens of steps.

i-chaochen commented 5 years ago

@i-chaochen Thanks! It's new interest state-of-art solution by using conv-LSTM not only for increasing accuracy, but also for speedup.

Did you understand what do they mean here?

Does it mean that there are two models:

f0 (large model 320x320 with depth 1.4x)

f1 (small model 160x160 with depth 0.35x)

And states c will be updated only by using f0 model during both Training and Detection (while f1 will not update sates c)?

We also observe that one inherent weakness of the LSTM is its inability to completely preserve its state across updates in practice. The sigmoid activations of the input and forget gates rarely saturate completely, resulting in a slow state decay where long-term dependencies are gradually lost. When compounded over many steps, predictions using the f1 degrade unless f0 is rerun. We propose a simple solution to this problem by simply skipping state updates when f1 is run, i.e. the output state from the last time f0 was run is always reused. This greatly improves the LSTM’s ability to propagate temporal information across long sequences, resulting in minimal loss of accuracy even when f1 is exclusively run for tens of steps.

Yes. You have a very sharp eye!

Based on their paper, f0 is for accuracy and f1 is for speed.

They use f0 occasionally for the updates of state, whilst f1 most of time for speed up the testing.

Thus, following this "simple" intuition, part of this paper contribution is to use "Reinforcement Learning" to learn an optimized interleaving policy for f0 and f1.

We can try to have this interleaving first.

AlexeyAB commented 5 years ago

Comparison of different models on a very small custom dataset - 250 training and 250 validation images from video: https://drive.google.com/open?id=1QzXSCkl9wqr73GHFLIdJ2IIRMgP1OnXG

Validation video: https://drive.google.com/open?id=1rdxV1hYSQs6MNxBSIO9dNkAiBvb07aun

Ideas are based on:

LSTM object detection - model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone: https://arxiv.org/abs/1903.10172v1
PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes: https://arxiv.org/abs/1803.01534v4

There are implemented:

convolutional-LSTM models for Training and Detection on Video, without interleaving lightweight network - may be will be implemented later
PANet models -
- _pan-networks - there is used [reorg3d] + [convolutional] size=1 instead of Adaptive Feature Pooling (depth-maxpool) for the Path Aggregation - may be will be implemented later
- _pan2-networks - there is used maxpooling [maxpool] maxpool_depth=1 out_channels=64 acrosss channels as in original PAN-paper, just previous layers are [convolutional] instead of [connected] for resizability

Model (cfg & weights) network size = 544x544	Training chart	Validation video	BFlops	Inference time RTX2070, ms	mAP, %
yolo_v3_spp_pan_lstm.cfg.txt (must be trained using frames from the video)	-	-	-	-	-
yolo_v3_tiny_pan3.cfg.txt and weights-file Features: PAN3, AntiAliasing, Assisted Excitation, scale_x_y, Mixup, GIoU		video	14	8.5 ms	67.3%
yolo_v3_tiny_pan5 matrix_gaussian_GIoU aa_ae_mixup_new.cfg.txt and weights-file Features: MatrixNet, Gaussian-yolo + GIoU, PAN5, IoU_thresh, Deformation-block, Assisted Excitation, scale_x_y, Mixup, 512x512, use `-thresh 0.6`		video	30	31 ms	64.6%
yolo_v3_tiny_pan3 aa_ae_mixup_scale_giou blur dropblock_mosaic.cfg.txt and weights-file		video	14	8.5 ms	63.51%
yolo_v3_spp_pan_scale.cfg.txt and weights-file		video	137	33.8 ms	60.4%
yolo_v3_spp_pan.cfg.txt and weights-file		video	137	33.8 ms	58.5%
yolo_v3_tiny_pan_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	23	14.9 ms	58.5%
tiny_v3_pan3_CenterNet_Gaus ae_mosaic_scale_iouthresh mosaic.txt and weights-file		video	25	14.5 ms	57.9%
yolo_v3_spp_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	102	26.0 ms	57.5%
yolo_v3_tiny_pan3 matrix_gaussian aa_ae_mixup.cfg.txt and weights-file		video	13	19.0 ms	57.2%
resnet152_trident.cfg.txt and weights-file train by using resnet152.201 pre-trained weights		video	193	110ms	56.6%
yolo_v3_tiny_pan_mixup.cfg.txt and weights-file		video	17	8.7 ms	52.4%
yolo_v3_spp.cfg.txt and weights-file (common old model)		video	112	23.5 ms	51.8%
yolo_v3_tiny_lstm.cfg.txt and weights-file (must be trained using frames from the video)		video	19	12.0 ms	50.9%
yolo_v3_tiny_pan2.cfg.txt and weights-file		video	14	7.0 ms	50.6%
yolo_v3_tiny_pan.cfg.txt and weights-file		video	17	8.7 ms	49.7%
yolov3-tiny_3l.cfg.txt (common old model) and weights-file		video	12	5.6 ms	46.8%
yolo_v3_tiny_comparison.cfg.txt and weights-file (approximately the same conv-layers as conv+conv_lstm layers in yolo_v3_tiny_lstm.cfg)		video	20	10.0 ms	36.1%
yolo_v3_tiny.cfg.txt (common old model) and weights-file		video	9	5.0 ms	32.3%
				-	-
				-	-

i-chaochen commented 5 years ago

Great work! Thank you very much for sharing this result.

LSTM indeed improves results. I wonder have you evaluated the inference time with LSTM as well?

Thanks

AlexeyAB commented 5 years ago

How to train LSTM networks:

Use one of cfg-file with LSTM in filename
Use pre-trained file
- for Tiny: use yolov3-tiny.conv.14 that you can get from https://pjreddie.com/media/files/yolov3-tiny.weights by using command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14
- for Full: use http://pjreddie.com/media/files/darknet53.conv.74
You should train it on sequential frames from one or several videos:
- ./yolo_mark data/self_driving cap_video self_driving.mp4 1 - it will grab each 1 frame from video (you can vary from 1 to 5)
- ./yolo_mark data/self_driving data/self_driving.txt data/self_driving.names - to mark bboxes, even if at some point the object is invisible (occlused/obscured by another type of object)
- ./darknet detector train data/self_driving.data yolo_v3_tiny_pan_lstm.cfg yolov3-tiny.conv.14 -map - to train the detector
- ./darknet detector demo data/self_driving.data yolo_v3_tiny_pan_lstm.cfg backup/yolo_v3_tiny_pan_lstm_last.weights forward.avi - run detection

If you encounter CUDA Out of memeory error, then reduce the value time_steps= twice in your cfg-file.

The only conditions - the frames from the video must go sequentially in the train.txt file. You should validate results on a separate Validation dataset, for example, divide your dataset into 2:

train.txt - first 80% of frames (80% from video1 + 80% from video 2, if you use frames from 2 videos)
valid.txt - last 20% of frames (20% from video1 + 20% from video 2, if you use frames from 2 videos)

Or you can use, for example:

train.txt - frames from some 8 videos
valid.txt - frames from some 2 videos

LSTM: 1200px-The_LSTM_cell

61124814-9e630680-a4b0-11e9-9fce-042832210fff

AlexeyAB commented 5 years ago

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

i-chaochen commented 5 years ago

@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.

Thanks for updates! What do you mean the inference time for seconds? is for the whole video? How about the inference time for each frame or FPS?

AlexeyAB commented 5 years ago

@i-chaochen This is a millisecond, I fixed )

i-chaochen commented 5 years ago

Interesting, it seems yolo_v3_spp_lstm has less BFLOPs(102) than yolo_v3_spp.cfg.txt (112), but it still slower...

AlexeyAB commented 5 years ago

@i-chaochen I removed some overheads (for calling a lot of functions and reading / writing to GPU-RAM) - I replaced these several functions for: f, i, g, o, c https://github.com/AlexeyAB/darknet/blob/b9ea49af250a3eab3b8775efa53db0f0ff063357/src/conv_lstm_layer.c#L866-L869 to the one fast function add_3_arrays_activate(float *a1, float *a2, float *a3, size_t size, ACTIVATION a, float *dst);

NickiBD commented 5 years ago

Hi @AlexeyAB I am trying to use yolo_v3_tiny_lstm.cfg to improve small object detection for videos .However I am getting the following error 14 Type not recognized: [conv_lstm] Unused field: 'batch_normalize = 1' Unused field: 'size = 3' Unused field: 'pad = 1' Unused field: 'output = 128' Unused field: 'peephole = 0' Unused field: 'activation = leaky' 15 Type not recognized: [conv_lstm] Unused field: 'batch_normalize = 1' Unused field: 'size = 3' Unused field: 'pad = 1' Unused field: 'output = 128' Unused field: 'peephole = 0' Unused field: 'activation = leaky'

Could you please advice me on this Many thanks

AlexeyAB commented 5 years ago

@NickiBD For these models you must use the latest version of this repository: https://github.com/AlexeyAB/darknet

NickiBD commented 5 years ago

@AlexeyAB

Thanks alot for the help .I will update my repository .

passion3394 commented 5 years ago

@AlexeyAB hi, how did you run yolov3-tiny on the Pixel smart phone, could you give some tips? thanks very much.

NickiBD commented 5 years ago

Hi @AlexeyAB, I have trained yolo_v3_tiny_lstm.cfg and I want to convert it to .h5 and then to .tflite for the smart phone . However ,I am getting Unsupported section header type: conv_lstm_0 and unsupported operation while converting . I really need to solve this issue .Could you please advice me on this. Many thanks .

AlexeyAB commented 5 years ago

@NickiBD Hi,

Which repository and which script do you use for this conversion?

NickiBD commented 5 years ago

Hi @AlexeyAB, I am using the converter in Adamdad/keras-YOLOv3-mobilenet to convert to .h5 and it was converting for other models e.g. yolo-v3-tiny 3layers ,modified yolov3 ,... .Could you please tell me which converter to use .

Many thanks .

AlexeyAB commented 5 years ago

@NickiBD It is a new layer [conv_lstm], so there is no any converter yet that supports it.

You should request from the converter author for adding convLSTM-layer (with disabled peephole-connection) Or for adding convLSTM-layer (with peephole-connection) - but you should train with peephole=1 in each [lstm]-layer in yolo_v3_tiny_lstm.cfg It will use in Keras, or

So ask it from:

Keras: https://github.com/Adamdad/keras-YOLOv3-mobilenet
Pytorch: https://github.com/ultralytics/yolov3#darknet-conversion
TensorFlow: https://github.com/mystic123/tensorflow-yolo-v3 or https://github.com/jinyu121/DW2TF

As I see conv-LSTM is implemented in:

Keras - keras.layers.ConvLSTM2D (without peephole - it's good): https://keras.io/layers/recurrent/
TensorFlow - tf.contrib.rnn.ConvLSTMCell : https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/ConvLSTMCell
Pytorch - isn't implemented yet: https://github.com/pytorch/pytorch/issues/1706

Conv-LSTM layer is based on this paper - Page 4: http://arxiv.org/abs/1506.04214v1 And can be used with peephole=1 or without peephole=0 Peehole-connection (red-boxes):

In peephole I use * - Convolution instead of o - Element-wise-product (Hadamard product), so convLSTM is still resizable - can be used with any network input resolution:

NickiBD commented 5 years ago

@AlexeyAB Thank you so much for all the info and the guidance .I truly appreciate it .

LukeAI commented 5 years ago

So could Yolov3_spp_pan.cfg be used with standard pretrained weights eg. coco ?

AlexeyAB commented 5 years ago

@LukeAI You must train yolov3_spp_pan.cfg from the begining by using one of pre-trained weights:

or darknet53.conv.74 that you can get from: http://pjreddie.com/media/files/darknet53.conv.74 (trained on ImageNet)
or yolov3-spp.conv.85 that you can get from https://pjreddie.com/media/files/yolov3-spp.weights by using command ./darknet partial cfg/yolov3-spp.cfg yolov3-spp.weights yolov3-spp.conv.85 85 (trained on MS COCO)

NickiBD commented 5 years ago

@AlexeyAB Sorry to disturb you again . I am now training yolo_v3_tiny_lstm.cfg with my custom dataset for 10000 iterations .I used the weights for 4000 iterations (mAP ~65%) for detection and the detection results were good .However, after 5000 iterations , the mAP dropped to zero and now I am on 6500 iteration it is almost mAP~2% .The frames from the video are sequentially ordered in the train.txt file and random=0. Could you please advice me on this that what might be the problem? Thanks .

AlexeyAB commented 5 years ago

@NickiBD

Can you show me chart.png with Loss & mAP charts?
And can you show output of ./darknet detector map command?

NickiBD commented 5 years ago

Hi @AlexeyAB These is the output of ./darknet detector map: layer filters size input output 0 conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16 0.150 BF 1 max 2 x 2 / 2 416 x 416 x 16 -> 208 x 208 x 16 0.003 BF 2 conv 32 3 x 3 / 1 208 x 208 x 16 -> 208 x 208 x 32 0.399 BF 3 max 2 x 2 / 2 208 x 208 x 32 -> 104 x 104 x 32 0.001 BF 4 conv 64 3 x 3 / 1 104 x 104 x 32 -> 104 x 104 x 64 0.399 BF 5 max 2 x 2 / 2 104 x 104 x 64 -> 52 x 52 x 64 0.001 BF 6 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BF 7 max 2 x 2 / 2 52 x 52 x 128 -> 26 x 26 x 128 0.000 BF 8 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BF 9 max 2 x 2 / 2 26 x 26 x 256 -> 13 x 13 x 256 0.000 BF 10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF 11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512 0.000 BF 12 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 13 conv 256 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 256 0.089 BF 14 CONV_LSTM Layer: 13 x 13 x 256 image, 128 filters conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF 15 CONV_LSTM Layer: 13 x 13 x 128 image, 128 filters conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF 16 conv 256 1 x 1 / 1 13 x 13 x 128 -> 13 x 13 x 256 0.011 BF 17 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF 18 conv 128 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 128 0.022 BF 19 upsample 2x 13 x 13 x 128 -> 26 x 26 x 128 20 route 19 8 21 conv 128 1 x 1 / 1 26 x 26 x 384 -> 26 x 26 x 128 0.066 BF 22 CONV_LSTM Layer: 26 x 26 x 128 image, 128 filters conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF 23 conv 128 1 x 1 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.022 BF 24 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BF 25 conv 128 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BF 26 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128 27 route 26 6 28 conv 64 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 64 0.089 BF 29 CONV_LSTM Layer: 52 x 52 x 64 image, 64 filters conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF 30 conv 64 1 x 1 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.022 BF 31 conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF 32 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BF 33 conv 18 1 x 1 / 1 52 x 52 x 128 -> 52 x 52 x 18 0.012 BF 34 yolo 35 route 24 36 conv 256 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 256 0.797 BF 37 conv 18 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 18 0.006 BF 38 yolo 39 route 17 40 conv 512 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x 512 0.797 BF 41 conv 18 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 18 0.003 BF 42 yolo Total BFLOPS 11.311 Allocate additional workspace_size = 33.55 MB Loading weights from LSTM/yolo_v3_tiny_lstm_7000.weights... seen 64 Done!

calculation mAP (mean average precision)...

2376 detections_count = 886, unique_truth_count = 1409
class_id = 0, name = Person, ap = 0.81% (TP = 0, FP = 0)

for thresh = 0.25, precision = -nan, recall = 0.00, F1-score = -nan for thresh = 0.25, TP = 0, FP = 0, FN = 1409, average IoU = 0.00 %

IoU threshold = 50 %, used Area-Under-Curve for each unique Recall mean average precision (mAP@0.50) = 0.008104, or 0.81 % Total Detection Time: 155.000000 Seconds

Set -points flag: -points 101 for MS COCO -points 11 for PascalVOC 2007 (uncomment difficult in voc.data) -points 0 (AUC) for ImageNet, PascalVOC 2010-2012, your custom dataset

Chart : Untitled

Many thanks

AlexeyAB commented 5 years ago

@NickiBD

The frames from the video are sequentially ordered in the train.txt file and random=0.

How many images do you have in train.txt?
How many different videos (parts of videos) did you use for Training dataset?

It seems something is still unstable in training LSTM, may be due to SGDR, so try to change these lines:

policy=sgdr
sgdr_cycle=1000
sgdr_mult=2
steps=4000,6000,8000,9000
#scales=1, 1, 0.1, 0.1
seq_scales=0.5, 1, 0.5, 1

to these lines

policy=steps
steps=4000,6000,8000,9000
scales=1, 1, 0.1, 0.1
seq_scales=0.5, 1, 0.5, 1

And train again.

NickiBD commented 5 years ago

@AlexeyAB

Thank you so much for the advice .I will make the changes and will train again . Regarding your questions :I have ~7500 images(including some augmentation of the images ) extracted from ~100 videos for training .

AlexeyAB commented 5 years ago

@NickiBD

I have ~7500 images(including some augmentation of the images ) extracted from ~100 videos for training .

So there are something like ~100 sequences of ~75 frames for each.

Yes, you can use it. But better to use ~200 sequential frames.

All frames in one sequence must use the same augmentation (the same cropping, scaling, color, ...). So you can make good video from these ~75 frames.

NickiBD commented 5 years ago

@AlexeyAB Many thanks for all the advice.

LukeAI commented 5 years ago

@AlexeyAB really looking forward to trying this out - very impressive results indeed and surely worth writing a paper on? Are you planning to do so? @NickiBD let us know how those .cfg changes work out :)

AlexeyAB commented 5 years ago

@NickiBD If it doesn't help, then also try to add parameter state_constrain=75 for each [conv_lstm] layer in cfg-file. This correlates with the maximum number of frames to remember.

Also do you get better result with lstm-model yolo_v3_tiny_lstm.cfg than with https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny_3l.cfg and can you show chart.png for yolov3-tiny_3l.cfg(not lstm)?

@LukeAI May be yes, after several improvements.

LukeAI commented 5 years ago

Have you implemented yolo_v3_spp_pan_lstm.cfg ?

NickiBD commented 5 years ago

@AlexeyAB Thank you for the guidance .This is the chart of yolo_v3_tiny3l.cfg .Based on the results I got in iterations before becoming unstable ,The detection results of yolo_v3_tiny_lstm was better than yolo_v3_tiny3l.cfg tinyyolo3l

AlexeyAB commented 5 years ago

@NickiBD So do you get higher mAp with yolo_v3_tiny3l.cfg than with yolo_v3_tiny_lstm.cfg?

NickiBD commented 5 years ago

@AlexeyAB yes ,mAP is so far higher than yolo_v3_tiny_lstm.cfg

sawsenrezig commented 5 years ago

Hi @AlexeyAB I'm using yolo_v3_spp_pan.cfg and trying to modify it for my use case, I see that the filters parameter is set to 24 for classes=1 instead of 18. How did you calculate this?

AlexeyAB commented 5 years ago

@sawsenrezig filters = (classes + 5) * 4

LukeAI commented 5 years ago

@AlexeyAB what is the formula for number of filters in the conv layers before yolo layers for yolov3_tiny_3l ?

AlexeyAB commented 5 years ago

@LukeAI In any cfg-file filters = (classes + 5) * num / number_of_yolo_layers

and count number_of_yolo layers:

LukeAI commented 5 years ago

ok! Wait... what is 'num' ?

gmayday1997 commented 5 years ago

ok! Wait... what is 'num' ?

'num' means the number of anchors.

NickiBD commented 5 years ago

Hi @AlexeyAB

Once again thank you for all your help .I tried to apply all your valuable suggestions except that I dont have 200 frames in each video sequence at the moment .However,still the training is unstable in my case and the accuracy drops significantly after 6000 iterations (almost 0) and goes up a bit after wards Could you please advice me on this . Many thanks in advance .

AlexeyAB commented 5 years ago

@NickiBD Try to set state_constrain=10 for each [conv_lstm] layer in your cfg-file. And use the remaining default settings, except filters= and classes=.

NickiBD commented 5 years ago

Hi @AlexeyAB Many thanks for the advice. I will apply that and let you know the result.

NickiBD commented 5 years ago

Hi @AlexeyAB

I tried to applied the the change but it is still unstable in my case and drops to 0 after 7000 iterations .However, the yolov3 -tiny PAN-lSTM worked fine with almost the same settings as the original cfg file and it was stable .Could you please give me advice of what might be the reason ?

As I need a very fast and accurate model for very small object detection to be worked on a smart phone ,I dont know whether yolov3 -tiny PAN-lSTM is better than yolov3-tiny 3l or yolov3-tiny -lstm for very small object detection . I would be really grateful if you could assist me on this on as well . Many thanks for all the help.

AlexeyAB commented 5 years ago

@NickiBD Hi,

I tried to applied the the change but it is still unstable in my case and drops to 0 after 7000 iterations .However, the yolov3 -tiny PAN-lSTM worked fine with almost the same settings as the original cfg file and it was stable .Could you please give me advice of what might be the reason ?

Do you mean that yolo_v3_tiny_pan_lstm.cfg.txt works fine, but yolo_v3_tiny_lstm.cfg.txt drops after 7000 iterations?
What is the max, min and average size of your objects? Calculate anchors and show me.
What is the average sequence length (how many frames in one sequence) in your dataset?

As I need a very fast and accurate model for very small object detection to be worked on a smart phone ,I dont know whether yolov3 -tiny PAN-lSTM is better than yolov3-tiny 3l or yolov3-tiny -lstm for very small object detection . I would be really grateful if you could assist me on this on as well . Many thanks for all the help.

Theoretically the best model for small objects should be - use it with the latest version of this repository:

on images: yolo_v3_tiny_pan_mixup.cfg.txt
on videos: yolo_v3_tiny_pan_lstm.cfg.txt

NickiBD commented 5 years ago

@AlexeyAB Hi, Thanks alot for the reply and all your advice.

yes .yolo_v3_tiny_PAN_lstm works fine and is stable but the accuracy of yolo_v3_tiny_lstm.cfg drops to 0 after 7000 iterations.

These are the calculated anchors : 5, 11, 7, 29, 13, 20, 12, 52, 23, 59, 49, 71

The number of frames varies as some videos are short and some are long. The number of frames are 75-100 frames for each video in the dataset.

Many thanks again for all the help.

AlexeyAB commented 5 years ago

@NickiBD So use yolo_v3_tiny_pan_lstm.cfg.txt instead of yolo_v3_tiny_lstm.cfg.txt, since yolo_v3_tiny_pan_lstm.cfg.txt is better in any case, especially for small objects. Use default anchors.

Could you please give me advice of what might be the reason ?

yolo_v3_tiny_lstm.cfg.txt uses longer sequences (time_steps=16 X augment_speed=3 = 48) than yolo_v3_tiny_pan_lstm.cfg.txt (time_steps=3 X augment_speed=3 = 9), so if you train yolo_v3_tiny_lstm.cfg.txt on short video-sequences it can lead to unstable training.
yolo_v3_tiny_lstm.cfg.txt isn't good for small objects. Since you use dataset with small objects, so it can lead to unstable training

NickiBD commented 5 years ago

@AlexeyAB Thank you so much for all the advice .

LukeAI commented 5 years ago

@AlexeyAB I see that you added TridentNet already! Do you have any results / training graphs? Maybe on the same dataset and network size as your table above so that we can compare?

AlexeyAB / darknet

Implement Yolo-LSTM (~+4-9 AP) for detection on Video with high mAP and without blinking issues #3114