AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.6k stars 7.95k forks source link

Conv Bottleneck-LSTM gives +3-5 AP and very cheap ~+1% BFLOPS #5774

Open AlexeyAB opened 4 years ago

AlexeyAB commented 4 years ago

image


image


image


image


image


Non-bottleneck conv-LSTM looks like: Detailed-architecture-of-the-peephole-LSTM


0-399867-614444

i-chaochen commented 4 years ago

Cool!

Actually, there is a follow-up work based on this paper that I mentioned it in the beginning of https://github.com/AlexeyAB/darknet/issues/3114

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection Mason https://arxiv.org/pdf/1903.10172.pdf

This one directly uses ConvLSTM to replace Bottleneck LSTM, which is better than the original bottleneck LSTM.

Moreover, to further improve LSTM, it makes 3 modifications on the original bottleneck LSTM:

  1. a skip connection
  2. divide LSTM state into groups and use grouped conv to process each one separetely.
  3. concatenate the channel-wise to get the c_t, h_t and M_t.

Screenshot 2020-05-28 at 20 04 11

AlexeyAB commented 4 years ago

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.

DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.


It isn't SOTA:

image

i-chaochen commented 4 years ago

What do you think about DETR (transformer for object detection)?

But for object detection on video (to predict objects on next frame) rather than for object detection on MSCOCO. Since Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers. DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that’s earned nearly 10,000 citations since it was introduced in 2015, according to arXiv.

It isn't SOTA:

image

Yes, I think in general, all RNN/LSTM can be replaced by Transformers, and Transformers should always outperform to RNN/LSTM.

My considerations is that the input of this seq2seq network is a sequence, which is designed for NLP, but in the video scenario, the input is one frame, only. (I am not 100% sure what input of LSTM in darknet's implementation)

Ideally, a sequence of frames will be perfect but it cannot be true in the real-time scenario (You can't have the next frame until you really in the next second). Since the input is simpler than NLP sequence, I am not sure how much benefit the model can get from the multi-attention head (Transformers). It seems a bit of overkill.

Also, it could cost more GPU memory and it cannot be trained at 1 GPU.

i-chaochen commented 4 years ago

Also, according to the original paper (DETR). It seems not very good at small objects detection. I think they probably tried FPN to solve this but still failed, probably not enough GPU memory or something else?

AlexeyAB commented 4 years ago

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames. Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.


Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ? And what do the call b, is it blue conv?

aaa


Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf image

i-chaochen commented 4 years ago

I think we should use Transformer+Memory to predict intermediate features for the current frame of video, based on several memorized previous frames. Then we should mix (concatenate + conv) predicted by Transformer features for the current + features for the current frame.

Also I don't understand, what activation they actually use between C and H, is it TANH, ReLU, nothing ... ? And what do the call b, is it blue conv?

aaa

Default Conv-LSTM uses TANH as described in their paper page 4: https://arxiv.org/pdf/1711.06368v2.pdf image

They use ReLU6 as you can see in the paper and source code Screenshot 2020-05-31 at 23 51 49

https://github.com/tensorflow/models/blob/master/research/lstm_object_detection/lstm/lstm_cells.py

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

AlexeyAB commented 4 years ago

Thanks, yes, they use ReLU for c too, although they did not draw it in the picture.

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L559

i-chaochen commented 4 years ago

Thanks, yes, they use ReLU for c too, although they did not draw it in the picture.

https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L559

No, I think they use tanh for c. https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L279

AlexeyAB commented 4 years ago

activation=tf.tanh is a default value for BottleneckConvLSTMCell as for regular LSTM: https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/lstm/lstm_cells.py#L44

But there the parameter is explicitly overwritten activation=tf.nn.relu6 when we call lstm_cells.BottleneckConvLSTMCell() https://github.com/tensorflow/models/blob/4ce55184dee479ded4d72d70e6a7d5b378edd703/research/lstm_object_detection/models/lstm_ssd_mobilenet_v1_feature_extractor.py#L106

AlexeyAB commented 4 years ago

I implemented BottleneckConvLSTM

[conv_lstm]
batch_normalize=1
size=3
pad=1
output=64
groups=2
peephole=0
bottleneck=1
#shortcut=1
time_normalizer=1.0
lstm_activation=tanh
activation=leaky

Just shortcut=1 uses partial n/2 channels residual connection (per-element addition) instead of concatenation for the skip-connection.

lstm_activation=tanh instead of ReLU.

And I think on GPU should be used groups=1 or 2, not higher than 2.

yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt


image

AlexeyAB commented 4 years ago

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.


Also I don't know what is the optimal value of time_normalizer=0.5

i-chaochen commented 4 years ago

@i-chaochen In my experiments shortcut=1 that uses partial n/2 channels residual connection (per-element addition) degrades accuracy very much.

So don't use it, until I change it to concatenate.

Also I don't know what is the optimal value of time_normalizer=0.5

Thanks for your update and sharing!

May I ask what is time_normalizer in the conv_lstm used for?

AlexeyAB commented 4 years ago

time_normalizer is a coefficient for deltas of time-backpropagation in lstm.

qingchunlizhi commented 3 years ago

when I train the yolov4-tiny with Conv Bottleneck-LSTM,it always shows cuda out of memory in Tesla V100 no matter what batch-size I set: [yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05 [7/1845] nms_kind: greedynms (1), beta = 0.600000 Total BFLOPS 10.741 avg_outputs = 226932 Allocate additional workspace_size = 26.22 MB yolov4-tiny-lstm 2 : compute_capability = 700, cudnn_half = 1, GPU: Tesla V100-SXM2-16GB net.optimized_memory = 0 mini_batch = 1024, batch = 1024, time_steps = 16, train = 1 layer filters size/strd(dil) input output 0 Try to set subdivisions=64 in your cfg-file. CUDA status Error: file: /home/darknet0808/src/dark_cuda.c : () : line: 373 : build time: Aug 8 2020 - 14:55:17

CUDA Error: out of memory CUDA Error: out of memory: File exists

this is my cfg: yolov4-lstm.cfg.txt

Please help me, thank you!

AlexeyAB commented 3 years ago

mini_batch = time_steps * batch / subdivisions

So set time_steps = 4 or 3

qingchunlizhi commented 3 years ago

thanks, it works! Em, could tell me the meaning of "track=1" and "time_steps=16"?

AlexeyAB commented 3 years ago

time_steps = 4 - number of sequential frames from video track=1 - it will use sequential frames instead of random frames. Read: https://github.com/AlexeyAB/darknet/wiki/CFG-Parameters-in-the-%5Bnet%5D-section

qingchunlizhi commented 3 years ago

Thank you for very much, This is great!

HaolyShiit commented 3 years ago

@AlexeyAB I traine https://github.com/AlexeyAB/darknet/files/4746552/yolo_v3_tiny_lstm_bottleneck_shortcut.cfg.txt with my own dataset, but it seems don't work well.

The loss drops as normal (I can't upload loss chart successfully ). The results on valid dataset is very bad.

Some information as follows:

3768: 0.925613, 0.354083 avg loss, 0.000913 rate, 3.341086 seconds, 482304 images, 5.769756 hours left sequential_subdivisions = 8, sequence = 1 Loaded: 0.000037 seconds v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.734949, GIOU: 0.724360), Class: 0.998445, Obj: 0.961654, No Obj: 0.000375, .5R: 1.000000, .75R: 0.437500, count: 48, class_loss = 0.061803, iou_loss = 0.556009, total_loss = 0.617812 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000702, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.786307, iou_loss = 0.000000, total_loss = 0.786307 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442 total_bbox = 968006, rewritten_bbox = 0.000000 % v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 33 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000169, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000073, iou_loss = 0.000000, total_loss = 0.000073 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 37 Avg (IOU: 0.859079, GIOU: 0.857110), Class: 0.998864, Obj: 0.994741, No Obj: 0.000915, .5R: 1.000000, .75R: 1.000000, count: 32, class_loss = 0.009228, iou_loss = 0.052743, total_loss = 0.061971 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 41 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000301, .5R: 0.000000, .75R: 0.000000, count: 1, class_loss = 0.000442, iou_loss = 0.000000, total_loss = 0.000442 total_bbox = 968038, rewritten_bbox = 0.000000 %

Is it normal? How to improve?