Open AlexeyAB opened 4 years ago
Cool!
Actually, there is a follow-up work based on this paper that I mentioned it in the beginning of https://github.com/AlexeyAB/darknet/issues/3114
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection Mason https://arxiv.org/pdf/1903.10172.pdf
This one directly uses ConvLSTM to replace Bottleneck LSTM, which is better than the original bottleneck LSTM.
Moreover, to further improve LSTM, it makes 3 modifications on the original bottleneck LSTM:
What do you think about DETR (transformer for object detection)?
But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.
Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.
DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.
It isn't SOTA:
What do you think about DETR (transformer for object detection)?
But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.
- paper: https://arxiv.org/abs/2005.12872
- code: https://github.com/facebookresearch/detr
- news: https://venturebeat.com/2020/05/28/facebook-ai-research-applies-transformer-architecture-to-streamline-object-detection-models/
Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers. DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.
It isn't SOTA:
Yes, I think in general, all RNN/LSTM can be replaced by Transformers, and Transformers should always outperform to RNN/LSTM.
My considerations is that the input of this seq2seq network is a sequence, which is designed for NLP, but in the video scenario, the input is one frame, only. (I am not 100% sure what input of LSTM in darknet's implementation)
Ideally, a sequence of frames will be perfect but it cannot be true in the real-time scenario (You can't have the next frame until you really in the next second). Since the input is simpler than NLP sequence, I am not sure how much benefit the model can get from the multi-attention head (Transformers). It seems a bit of overkill.
Also, it could cost more GPU memory and it cannot be trained at 1 GPU.
Also, according to the original paper (DETR). It seems not very good at small objects detection. I think they probably tried FPN to solve this but still failed, probably not enough GPU memory or something else?
I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames. Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.
Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ?
And what do the call b
, is it blue conv?
Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf
I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames. Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.
Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ? And what do the call
b
, is it blue conv?Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf
They use ReLU6 as you can see in the paper and source code
https://github.com/tensorflow/models/blob/master/research/lstm_object_detection/lstm/lstm_cells.py
Thanks, yes, they use ReLU for c
too, although they did not draw it in the picture.
Thanks, yes, they use ReLU for
c
too, although they did not draw it in the picture.
No, I think they use tanh for c
.
https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L279
activation=tf.tanh
is a default value for BottleneckConvLSTMCell as for regular LSTM: https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L44
But there the parameter is explicitly overwritten activation=tf.nn.relu6
when we call lstm_cells.BottleneckConvLSTMCell()
https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106
I implemented BottleneckConvLSTM
[conv_lstm]
batch_normalize=1
size=3
pad=1
output=64
groups=2
peephole=0
bottleneck=1
#shortcut=1
time_normalizer=1.0
lstm_activation=tanh
activation=leaky
Just shortcut=1
uses partial n/2 channels
residual connection (per-element addition) instead of concatenation for the skip-connection.
lstm_activation=tanh instead of ReLU.
And I think on GPU should be used groups=1 or 2, not higher than 2.
yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt
@i-chaochen In my experiments shortcut=1
that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.
So don't use it, until I change it to concatenate.
Also I don't know what is the optimal value of time_normalizer=0.5
@i-chaochen In my experiments
shortcut=1
that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.So don't use it, until I change it to concatenate.
Also I don't know what is the optimal value of
time_normalizer=0.5
Thanks for your update and sharing!
May I ask what is time_normalizer
in the conv_lstm used for?
time_normalizer
is a coefficient for deltas
of time-backpropagation in lstm.
time_normalizer
- it learns time-dependencies moretime_normalizer
- it learns spatial-dependencies morewhen I train the yolov4-tiny with Conv Bottleneck-LSTM,it always shows cuda out of memory in Tesla V100 no matter what batch-size I set: [yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 [7/1845] nms_kind: greedynms (1), beta = 0.600000 Total BFLOPS 10.741 avg_outputs = 226932 Allocate additional workspace_size = 26.22 MB yolov4-tiny-lstm 2 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-16GB net.optimized_memory = 0 mini_batch = 1024, batch = 1024, time_steps = 16, train = 1 layer filters size/strd(dil) input output 0 Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: /home/darknet0808/src/dark_cuda.c : () : line: 373 : build time: Aug 8 2020 - 14:55:17
CUDA Error: out of memory CUDA Error: out of memory: File exists
this is my cfg: yolov4-lstm.cfg.txt
Please help me, thank you!
mini_batch = time_steps * batch / subdivisions
So set time_steps = 4
or 3
thanks, it works! Em, could tell me the meaning of "track=1" and "time_steps=16"?
time_steps = 4
- number of sequential frames from video
track=1
- it will use sequential frames instead of random frames.
Read: https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section
Thank you for very much, This is great!
@AlexeyAB I traine https://github.com/AlexeyAB/darknet/files/4746552/yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt with my own dataset, but it seems don't work well.
The loss drops as normal (I can't upload loss chart successfully ). The results on valid dataset is very bad.
Some information as follows:
3768: 0.925613, 0.354083 avg loss, 0.000913 rate, 3.341086 seconds, 482304 images, 5.769756 hours left sequential_subdivisions = 8, sequence = 1 Loaded: 0.000037 seconds v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.734949, GIOU: 0.724360), Class: 0.998445, Obj: 0.961654, No Obj: 0.000375, .5R: 1.000000, .75R: 0.437500, count: 48, class_loss = 0.061803, iou_loss = 0.556009, total_loss = 0.617812 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000702, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.786307, iou_loss = 0.000000, total_loss = 0.786307 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442 total_bbox = 968006, rewritten_bbox = 0.000000 % v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000169, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000073, iou_loss = 0.000000, total_loss = 0.000073 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.859079, GIOU: 0.857110), Class: 0.998864, Obj: 0.994741, No Obj: 0.000915, .5R: 1.000000, .75R: 1.000000, count: 32, class_loss = 0.009228, iou_loss = 0.052743, total_loss = 0.061971 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442 total_bbox = 968038, rewritten_bbox = 0.000000 %
Is it normal? How to improve?
paper: https://arxiv.org/abs/1711.06368
+5 AP for small model, and + 2.8 AP for big model
Implement conv Bottleneck-LSTM that gives +3-5 AP and very cheap ~+1% BFLOPS
Non-bottleneck conv-LSTM looks like: