Open AlexeyAB opened 5 years ago
Yes, I was looking this similar thing a few weeks ago. You might find this paper interesting.
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection https://arxiv.org/abs/1903.10172
Mobile Video Object Detection with Temporally-Aware Feature Maps https://arxiv.org/pdf/1711.06368.pdf
source code: https://github.com/tensorflow/models/tree/master/research/lstm_object_detection
@i-chaochen Thanks! It's new interest state-of-art solution by using conv-LSTM not only for increasing accuracy, but also for speedup.
Did you understand what do they mean here?
Does it mean that there are two models:
320x320
with depth 1.4x
) 160x160
with depth 0.35x
)And states c will be updated only by using f0 model during both Training and Detection (while f1 will not update sates c)?
We also observe that one inherent weakness of the LSTM is its inability to completely preserve its state across updates in practice. The sigmoid activations of the input and forget gates rarely saturate completely, resulting in a slow state decay where long-term dependencies are gradually lost. When compounded over many steps, predictions using the f1 degrade unless f0 is rerun. We propose a simple solution to this problem by simply skipping state updates when f1 is run, i.e. the output state from the last time f0 was run is always reused. This greatly improves the LSTM’s ability to propagate temporal information across long sequences, resulting in minimal loss of accuracy even when f1 is exclusively run for tens of steps.
@i-chaochen Thanks! It's new interest state-of-art solution by using conv-LSTM not only for increasing accuracy, but also for speedup.
Did you understand what do they mean here?
Does it mean that there are two models:
- f0 (large model
320x320
with depth1.4x
)- f1 (small model
160x160
with depth0.35x
)And states c will be updated only by using f0 model during both Training and Detection (while f1 will not update sates c)?
We also observe that one inherent weakness of the LSTM is its inability to completely preserve its state across updates in practice. The sigmoid activations of the input and forget gates rarely saturate completely, resulting in a slow state decay where long-term dependencies are gradually lost. When compounded over many steps, predictions using the f1 degrade unless f0 is rerun. We propose a simple solution to this problem by simply skipping state updates when f1 is run, i.e. the output state from the last time f0 was run is always reused. This greatly improves the LSTM’s ability to propagate temporal information across long sequences, resulting in minimal loss of accuracy even when f1 is exclusively run for tens of steps.
Yes. You have a very sharp eye!
Based on their paper, f0 is for accuracy and f1 is for speed.
They use f0 occasionally for the updates of state, whilst f1 most of time for speed up the testing.
Thus, following this "simple" intuition, part of this paper contribution is to use "Reinforcement Learning" to learn an optimized interleaving policy for f0 and f1.
We can try to have this interleaving first.
Comparison of different models on a very small custom dataset - 250 training and 250 validation images from video: https://drive.google.com/open?id=1QzXSCkl9wqr73GHFLIdJ2IIRMgP1OnXG
Validation video: https://drive.google.com/open?id=1rdxV1hYSQs6MNxBSIO9dNkAiBvb07aun
Ideas are based on:
LSTM object detection - model achieves state-of-the-art performance among mobile methods on the Imagenet VID 2015 dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone: https://arxiv.org/abs/1903.10172v1
PANet reaches the 1st place in the COCO 2017 Challenge Instance Segmentation task and the 2nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes: https://arxiv.org/abs/1803.01534v4
There are implemented:
convolutional-LSTM models for Training and Detection on Video, without interleaving lightweight network - may be will be implemented later
PANet models -
_pan
-networks - there is used [reorg3d]
+ [convolutional] size=1
instead of Adaptive Feature Pooling (depth-maxpool) for the Path Aggregation - may be will be implemented later_pan2
-networks - there is used maxpooling [maxpool] maxpool_depth=1 out_channels=64
acrosss channels as in original PAN-paper, just previous layers are [convolutional] instead of [connected] for resizabilityModel (cfg & weights) network size = 544x544 | Training chart | Validation video | BFlops | Inference time RTX2070, ms | mAP, % |
---|---|---|---|---|---|
yolo_v3_spp_pan_lstm.cfg.txt (must be trained using frames from the video) | - | - | - | - | - |
yolo_v3_tiny_pan3.cfg.txt and weights-file Features: PAN3, AntiAliasing, Assisted Excitation, scale_x_y, Mixup, GIoU | video | 14 | 8.5 ms | 67.3% | |
yolo_v3_tiny_pan5 matrix_gaussian_GIoU aa_ae_mixup_new.cfg.txt and weights-file Features: MatrixNet, Gaussian-yolo + GIoU, PAN5, IoU_thresh, Deformation-block, Assisted Excitation, scale_x_y, Mixup, 512x512, use -thresh 0.6 |
video | 30 | 31 ms | 64.6% | |
yolo_v3_tiny_pan3 aa_ae_mixup_scale_giou blur dropblock_mosaic.cfg.txt and weights-file | video | 14 | 8.5 ms | 63.51% | |
yolo_v3_spp_pan_scale.cfg.txt and weights-file | video | 137 | 33.8 ms | 60.4% | |
yolo_v3_spp_pan.cfg.txt and weights-file | video | 137 | 33.8 ms | 58.5% | |
yolo_v3_tiny_pan_lstm.cfg.txt and weights-file (must be trained using frames from the video) | video | 23 | 14.9 ms | 58.5% | |
tiny_v3_pan3_CenterNet_Gaus ae_mosaic_scale_iouthresh mosaic.txt and weights-file | video | 25 | 14.5 ms | 57.9% | |
yolo_v3_spp_lstm.cfg.txt and weights-file (must be trained using frames from the video) | video | 102 | 26.0 ms | 57.5% | |
yolo_v3_tiny_pan3 matrix_gaussian aa_ae_mixup.cfg.txt and weights-file | video | 13 | 19.0 ms | 57.2% | |
resnet152_trident.cfg.txt and weights-file train by using resnet152.201 pre-trained weights | video | 193 | 110ms | 56.6% | |
yolo_v3_tiny_pan_mixup.cfg.txt and weights-file | video | 17 | 8.7 ms | 52.4% | |
yolo_v3_spp.cfg.txt and weights-file (common old model) | video | 112 | 23.5 ms | 51.8% | |
yolo_v3_tiny_lstm.cfg.txt and weights-file (must be trained using frames from the video) | video | 19 | 12.0 ms | 50.9% | |
yolo_v3_tiny_pan2.cfg.txt and weights-file | video | 14 | 7.0 ms | 50.6% | |
yolo_v3_tiny_pan.cfg.txt and weights-file | video | 17 | 8.7 ms | 49.7% | |
yolov3-tiny_3l.cfg.txt (common old model) and weights-file | video | 12 | 5.6 ms | 46.8% | |
yolo_v3_tiny_comparison.cfg.txt and weights-file (approximately the same conv-layers as conv+conv_lstm layers in yolo_v3_tiny_lstm.cfg) | video | 20 | 10.0 ms | 36.1% | |
yolo_v3_tiny.cfg.txt (common old model) and weights-file | video | 9 | 5.0 ms | 32.3% | |
- | - | ||||
- | - |
Great work! Thank you very much for sharing this result.
LSTM indeed improves results. I wonder have you evaluated the inference time with LSTM as well?
Thanks
How to train LSTM networks:
Use one of cfg-file with LSTM
in filename
Use pre-trained file
for Tiny: use yolov3-tiny.conv.14
that you can get from https://pjreddie.com/media/files/yolov3-tiny.weights by using command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14
for Full: use http://pjreddie.com/media/files/darknet53.conv.74
You should train it on sequential frames from one or several videos:
./yolo_mark data/self_driving cap_video self_driving.mp4 1
- it will grab each 1 frame from video (you can vary from 1 to 5)
./yolo_mark data/self_driving data/self_driving.txt data/self_driving.names
- to mark bboxes, even if at some point the object is invisible (occlused/obscured by another type of object)
./darknet detector train data/self_driving.data yolo_v3_tiny_pan_lstm.cfg yolov3-tiny.conv.14 -map
- to train the detector
./darknet detector demo data/self_driving.data yolo_v3_tiny_pan_lstm.cfg backup/yolo_v3_tiny_pan_lstm_last.weights forward.avi
- run detection
If you encounter CUDA Out of memeory error, then reduce the value time_steps=
twice in your cfg-file.
The only conditions - the frames from the video must go sequentially in the train.txt
file.
You should validate results on a separate Validation dataset, for example, divide your dataset into 2:
train.txt
- first 80% of frames (80% from video1 + 80% from video 2, if you use frames from 2 videos)valid.txt
- last 20% of frames (20% from video1 + 20% from video 2, if you use frames from 2 videos)Or you can use, for example:
train.txt
- frames from some 8 videosvalid.txt
- frames from some 2 videosLSTM:
@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.
@i-chaochen I added the inference time to the table. When I improve the inference time for LSTM-networks, I will change them.
Thanks for updates! What do you mean the inference time for seconds? is for the whole video? How about the inference time for each frame or FPS?
@i-chaochen This is a millisecond, I fixed )
Interesting, it seems yolo_v3_spp_lstm has less BFLOPs(102) than yolo_v3_spp.cfg.txt (112), but it still slower...
@i-chaochen
I removed some overheads (for calling a lot of functions and reading / writing to GPU-RAM) - I replaced these several functions for: f, i, g, o, c https://github.com/AlexeyAB/darknet/blob/b9ea49af250a3eab3b8775efa53db0f0ff063357/src/conv_lstm_layer.c#L866-L869
to the one fast function add_3_arrays_activate(float *a1, float *a2, float *a3, size_t size, ACTIVATION a, float *dst);
Hi @AlexeyAB I am trying to use yolo_v3_tiny_lstm.cfg to improve small object detection for videos .However I am getting the following error 14 Type not recognized: [conv_lstm] Unused field: 'batch_normalize = 1' Unused field: 'size = 3' Unused field: 'pad = 1' Unused field: 'output = 128' Unused field: 'peephole = 0' Unused field: 'activation = leaky' 15 Type not recognized: [conv_lstm] Unused field: 'batch_normalize = 1' Unused field: 'size = 3' Unused field: 'pad = 1' Unused field: 'output = 128' Unused field: 'peephole = 0' Unused field: 'activation = leaky'
Could you please advice me on this Many thanks
@NickiBD For these models you must use the latest version of this repository: https://github.com/AlexeyAB/darknet
@AlexeyAB
Thanks alot for the help .I will update my repository .
@AlexeyAB hi, how did you run yolov3-tiny on the Pixel smart phone, could you give some tips? thanks very much.
Hi @AlexeyAB, I have trained yolo_v3_tiny_lstm.cfg and I want to convert it to .h5 and then to .tflite for the smart phone . However ,I am getting Unsupported section header type: conv_lstm_0 and unsupported operation while converting . I really need to solve this issue .Could you please advice me on this. Many thanks .
@NickiBD Hi,
Which repository and which script do you use for this conversion?
Hi @AlexeyAB, I am using the converter in Adamdad/keras-YOLOv3-mobilenet to convert to .h5 and it was converting for other models e.g. yolo-v3-tiny 3layers ,modified yolov3 ,... .Could you please tell me which converter to use .
Many thanks .
@NickiBD
It is a new layer [conv_lstm]
, so there is no any converter yet that supports it.
You should request from the converter author for adding convLSTM-layer (with disabled peephole-connection)
Or for adding convLSTM-layer (with peephole-connection) - but you should train with peephole=1
in each [lstm]-layer in yolo_v3_tiny_lstm.cfg
It will use in Keras, or
So ask it from:
As I see conv-LSTM is implemented in:
keras.layers.ConvLSTM2D
(without peephole - it's good): https://keras.io/layers/recurrent/tf.contrib.rnn.ConvLSTMCell
: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/ConvLSTMCellConv-LSTM layer is based on this paper - Page 4: http://arxiv.org/abs/1506.04214v1
And can be used with peephole=1
or without peephole=0
Peehole-connection (red-boxes):
In peephole I use *
- Convolution instead of o
- Element-wise-product (Hadamard product),
so convLSTM is still resizable - can be used with any network input resolution:
@AlexeyAB Thank you so much for all the info and the guidance .I truly appreciate it .
So could Yolov3_spp_pan.cfg be used with standard pretrained weights eg. coco ?
@LukeAI You must train yolov3_spp_pan.cfg
from the begining by using one of pre-trained weights:
or darknet53.conv.74
that you can get from: http://pjreddie.com/media/files/darknet53.conv.74 (trained on ImageNet)
or yolov3-spp.conv.85
that you can get from https://pjreddie.com/media/files/yolov3-spp.weights by using command ./darknet partial cfg/yolov3-spp.cfg yolov3-spp.weights yolov3-spp.conv.85 85
(trained on MS COCO)
@AlexeyAB Sorry to disturb you again . I am now training yolo_v3_tiny_lstm.cfg with my custom dataset for 10000 iterations .I used the weights for 4000 iterations (mAP ~65%) for detection and the detection results were good .However, after 5000 iterations , the mAP dropped to zero and now I am on 6500 iteration it is almost mAP~2% .The frames from the video are sequentially ordered in the train.txt file and random=0. Could you please advice me on this that what might be the problem? Thanks .
@NickiBD
Can you show me chart.png
with Loss & mAP charts?
And can you show output of ./darknet detector map
command?
Hi @AlexeyAB These is the output of ./darknet detector map: layer filters size input output 0 conv 16 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 16 0.150 BF 1 max 2 x 2 / 2 416 x 416 x 16 -> 208 x 208 x 16 0.003 BF 2 conv 32 3 x 3 / 1 208 x 208 x 16 -> 208 x 208 x 32 0.399 BF 3 max 2 x 2 / 2 208 x 208 x 32 -> 104 x 104 x 32 0.001 BF 4 conv 64 3 x 3 / 1 104 x 104 x 32 -> 104 x 104 x 64 0.399 BF 5 max 2 x 2 / 2 104 x 104 x 64 -> 52 x 52 x 64 0.001 BF 6 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BF 7 max 2 x 2 / 2 52 x 52 x 128 -> 26 x 26 x 128 0.000 BF 8 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BF 9 max 2 x 2 / 2 26 x 26 x 256 -> 13 x 13 x 256 0.000 BF 10 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF 11 max 2 x 2 / 1 13 x 13 x 512 -> 13 x 13 x 512 0.000 BF 12 conv 1024 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x1024 1.595 BF 13 conv 256 1 x 1 / 1 13 x 13 x1024 -> 13 x 13 x 256 0.089 BF 14 CONV_LSTM Layer: 13 x 13 x 256 image, 128 filters conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 128 0.100 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF 15 CONV_LSTM Layer: 13 x 13 x 128 image, 128 filters conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF conv 128 3 x 3 / 1 13 x 13 x 128 -> 13 x 13 x 128 0.050 BF 16 conv 256 1 x 1 / 1 13 x 13 x 128 -> 13 x 13 x 256 0.011 BF 17 conv 512 3 x 3 / 1 13 x 13 x 256 -> 13 x 13 x 512 0.399 BF 18 conv 128 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 128 0.022 BF 19 upsample 2x 13 x 13 x 128 -> 26 x 26 x 128 20 route 19 8 21 conv 128 1 x 1 / 1 26 x 26 x 384 -> 26 x 26 x 128 0.066 BF 22 CONV_LSTM Layer: 26 x 26 x 128 image, 128 filters conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF conv 128 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.199 BF 23 conv 128 1 x 1 / 1 26 x 26 x 128 -> 26 x 26 x 128 0.022 BF 24 conv 256 3 x 3 / 1 26 x 26 x 128 -> 26 x 26 x 256 0.399 BF 25 conv 128 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 128 0.044 BF 26 upsample 2x 26 x 26 x 128 -> 52 x 52 x 128 27 route 26 6 28 conv 64 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 64 0.089 BF 29 CONV_LSTM Layer: 52 x 52 x 64 image, 64 filters conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF 30 conv 64 1 x 1 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.022 BF 31 conv 64 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 64 0.199 BF 32 conv 128 3 x 3 / 1 52 x 52 x 64 -> 52 x 52 x 128 0.399 BF 33 conv 18 1 x 1 / 1 52 x 52 x 128 -> 52 x 52 x 18 0.012 BF 34 yolo 35 route 24 36 conv 256 3 x 3 / 1 26 x 26 x 256 -> 26 x 26 x 256 0.797 BF 37 conv 18 1 x 1 / 1 26 x 26 x 256 -> 26 x 26 x 18 0.006 BF 38 yolo 39 route 17 40 conv 512 3 x 3 / 1 13 x 13 x 512 -> 13 x 13 x 512 0.797 BF 41 conv 18 1 x 1 / 1 13 x 13 x 512 -> 13 x 13 x 18 0.003 BF 42 yolo Total BFLOPS 11.311 Allocate additional workspace_size = 33.55 MB Loading weights from LSTM/yolo_v3_tiny_lstm_7000.weights... seen 64 Done!
calculation mAP (mean average precision)...
2376
detections_count = 886, unique_truth_count = 1409
class_id = 0, name = Person, ap = 0.81% (TP = 0, FP = 0)
for thresh = 0.25, precision = -nan, recall = 0.00, F1-score = -nan for thresh = 0.25, TP = 0, FP = 0, FN = 1409, average IoU = 0.00 %
IoU threshold = 50 %, used Area-Under-Curve for each unique Recall mean average precision (mAP@0.50) = 0.008104, or 0.81 % Total Detection Time: 155.000000 Seconds
Set -points flag:
-points 101
for MS COCO
-points 11
for PascalVOC 2007 (uncomment difficult
in voc.data)
-points 0
(AUC) for ImageNet, PascalVOC 2010-2012, your custom dataset
Chart :
Many thanks
@NickiBD
The frames from the video are sequentially ordered in the train.txt file and random=0.
How many images do you have in train.txt?
How many different videos (parts of videos) did you use for Training dataset?
It seems something is still unstable in training LSTM, may be due to SGDR, so try to change these lines:
policy=sgdr
sgdr_cycle=1000
sgdr_mult=2
steps=4000,6000,8000,9000
#scales=1, 1, 0.1, 0.1
seq_scales=0.5, 1, 0.5, 1
to these lines
policy=steps
steps=4000,6000,8000,9000
scales=1, 1, 0.1, 0.1
seq_scales=0.5, 1, 0.5, 1
And train again.
@AlexeyAB
Thank you so much for the advice .I will make the changes and will train again . Regarding your questions :I have ~7500 images(including some augmentation of the images ) extracted from ~100 videos for training .
@NickiBD
I have ~7500 images(including some augmentation of the images ) extracted from ~100 videos for training .
So there are something like ~100 sequences of ~75 frames for each.
Yes, you can use it. But better to use ~200 sequential frames.
All frames in one sequence must use the same augmentation (the same cropping, scaling, color, ...). So you can make good video from these ~75 frames.
@AlexeyAB Many thanks for all the advice.
@AlexeyAB really looking forward to trying this out - very impressive results indeed and surely worth writing a paper on? Are you planning to do so? @NickiBD let us know how those .cfg changes work out :)
@NickiBD If it doesn't help, then also try to add parameter state_constrain=75
for each [conv_lstm]
layer in cfg-file. This correlates with the maximum number of frames to remember.
Also do you get better result with lstm-model yolo_v3_tiny_lstm.cfg
than with https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-tiny_3l.cfg and can you show chart.png
for yolov3-tiny_3l.cfg
(not lstm)?
@LukeAI May be yes, after several improvements.
Have you implemented yolo_v3_spp_pan_lstm.cfg ?
@AlexeyAB Thank you for the guidance .This is the chart of yolo_v3_tiny3l.cfg .Based on the results I got in iterations before becoming unstable ,The detection results of yolo_v3_tiny_lstm was better than yolo_v3_tiny3l.cfg
@NickiBD So do you get higher mAp with yolo_v3_tiny3l.cfg
than with yolo_v3_tiny_lstm.cfg
?
@AlexeyAB yes ,mAP is so far higher than yolo_v3_tiny_lstm.cfg
Hi @AlexeyAB I'm using yolo_v3_spp_pan.cfg and trying to modify it for my use case, I see that the filters parameter is set to 24 for classes=1 instead of 18. How did you calculate this?
@sawsenrezig filters = (classes + 5) * 4
@AlexeyAB what is the formula for number of filters in the conv layers before yolo layers for yolov3_tiny_3l ?
@LukeAI In any cfg-file filters = (classes + 5) * num / number_of_yolo_layers
and count number_of_yolo layers:
ok! Wait... what is 'num' ?
ok! Wait... what is 'num' ?
'num' means the number of anchors.
Hi @AlexeyAB
Once again thank you for all your help .I tried to apply all your valuable suggestions except that I dont have 200 frames in each video sequence at the moment .However,still the training is unstable in my case and the accuracy drops significantly after 6000 iterations (almost 0) and goes up a bit after wards Could you please advice me on this . Many thanks in advance .
@NickiBD Try to set state_constrain=10
for each [conv_lstm]
layer in your cfg-file. And use the remaining default settings, except filters= and classes=.
Hi @AlexeyAB Many thanks for the advice. I will apply that and let you know the result.
Hi @AlexeyAB
I tried to applied the the change but it is still unstable in my case and drops to 0 after 7000 iterations .However, the yolov3 -tiny PAN-lSTM worked fine with almost the same settings as the original cfg file and it was stable .Could you please give me advice of what might be the reason ?
As I need a very fast and accurate model for very small object detection to be worked on a smart phone ,I dont know whether yolov3 -tiny PAN-lSTM is better than yolov3-tiny 3l or yolov3-tiny -lstm for very small object detection . I would be really grateful if you could assist me on this on as well . Many thanks for all the help.
@NickiBD Hi,
I tried to applied the the change but it is still unstable in my case and drops to 0 after 7000 iterations .However, the yolov3 -tiny PAN-lSTM worked fine with almost the same settings as the original cfg file and it was stable .Could you please give me advice of what might be the reason ?
Do you mean that yolo_v3_tiny_pan_lstm.cfg.txt
works fine, but yolo_v3_tiny_lstm.cfg.txt
drops after 7000 iterations?
What is the max, min and average size of your objects? Calculate anchors and show me.
What is the average sequence length (how many frames in one sequence) in your dataset?
As I need a very fast and accurate model for very small object detection to be worked on a smart phone ,I dont know whether yolov3 -tiny PAN-lSTM is better than yolov3-tiny 3l or yolov3-tiny -lstm for very small object detection . I would be really grateful if you could assist me on this on as well . Many thanks for all the help.
Theoretically the best model for small objects should be - use it with the latest version of this repository:
yolo_v3_tiny_pan_mixup.cfg.txt
yolo_v3_tiny_pan_lstm.cfg.txt
@AlexeyAB Hi, Thanks alot for the reply and all your advice.
yes .yolo_v3_tiny_PAN_lstm works fine and is stable but the accuracy of yolo_v3_tiny_lstm.cfg drops to 0 after 7000 iterations.
These are the calculated anchors : 5, 11, 7, 29, 13, 20, 12, 52, 23, 59, 49, 71
The number of frames varies as some videos are short and some are long. The number of frames are 75-100 frames for each video in the dataset.
Many thanks again for all the help.
@NickiBD
So use yolo_v3_tiny_pan_lstm.cfg.txt
instead of yolo_v3_tiny_lstm.cfg.txt
,
since yolo_v3_tiny_pan_lstm.cfg.txt
is better in any case, especially for small objects.
Use default anchors.
Could you please give me advice of what might be the reason ?
yolo_v3_tiny_lstm.cfg.txt
uses longer sequences (time_steps=16 X augment_speed=3 = 48) than yolo_v3_tiny_pan_lstm.cfg.txt
(time_steps=3 X augment_speed=3 = 9),
so if you train yolo_v3_tiny_lstm.cfg.txt
on short video-sequences it can lead to unstable training.
yolo_v3_tiny_lstm.cfg.txt
isn't good for small objects. Since you use dataset with small objects, so it can lead to unstable training
@AlexeyAB Thank you so much for all the advice .
@AlexeyAB I see that you added TridentNet already! Do you have any results / training graphs? Maybe on the same dataset and network size as your table above so that we can compare?
Implement Yolo-LSTM detection network that will be trained on Video-frames for mAP increasing and solve blinking issues.
Think about - can we use Transformer (Vaswani et al., 2017) / GPT2 / BERT for frame-sequences instead of word-sequences https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf and https://vk.com/away.php?to=https%3A%2F%2Farxiv.org%2Fpdf%2F1706.03762.pdf&cc_key=
Or can we use Transformer-XL https://arxiv.org/abs/1901.02860v2 or UNIVERSAL TRANSFORMERS https://arxiv.org/abs/1807.03819v3 for Long-time sequences?