AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.64k stars 7.96k forks source link

Question: simultaneously train yolo for object detection and tracking using auxiliary layers and cosine similarity #6004

Open pfeatherstone opened 4 years ago

pfeatherstone commented 4 years ago

Can you fork some of the intermediate outputs of the backbone network into a few auxiliary layers which can be trained, using a cosine similarity loss function (or something similar), to output features suitable for tracking objects. Maybe this could be a two-stage training process, or this could be done end to end.

I've been working on some object tracking using detections. Kalman filter solutions don't always work and it's the same story with optfow methods. The best i've seen so far, is to use cross correlation techniques (like the correlation tracker in the dlib library) but they are quite slow and scales with the number of detections. Methods like deepsort use a separate network to extract features which can be used to compare objects. But the yolo network is already calculating features. Surely, you only need to fork off some of those features, maybe pass them through a few additional conv layers, into a cosine similarity loss function, and voila.

Am I crazy or does this sound sensible?

AlexeyAB commented 4 years ago

This is in progress: Conv-LSTM + Constrastive loss (Embeddings for cosine similarity) + Yolo

Am I crazy or does this sound sensible?

So you are not crazy, this is the most obvious way to do Detection & Tracking on Video )


Experimental YOLOv4-tiny-contrastive model that is trained on MSCOCO:

./darknet detector demo cfg/coco.data cfg/yolov4-tiny_contrastive.cfg yolov4-tiny_contrastive_last.weights test.avi -out_filename out_test.avi -ext_output

pfeatherstone commented 4 years ago

Do you need the lstm bit ? Are you trying to do re-identification inside the network? I was thinking doing re-identification post processing using the feature set of each detected object with those of the previous frame. Essentially doing deepsort but recycling the features calculated by the detection network, and letting the hungarian algorithm do the re-identification by maximising similarity.

pfeatherstone commented 4 years ago

i was thinking the features could be added to the list of yolo features. So instead of having features [tx,ty,tw,th,p,c0,...,c79], have [tx,ty,tw,th,p,c1,...,c79,f0,....,fN-1] at each yolo layer. The features aren't taken from the last conv layer the yolo layer saw, they are taken from earlier on.

pfeatherstone commented 4 years ago

Similarly to normal yolo logic, the cell responsible for the detection, is also responsible for having the correct set of features, somehow...

pfeatherstone commented 4 years ago

actually that might only make sense for training end-to-end detection + tracking. I was hoping to just fork off some features, and use them as input to another 'tracking' network with a very minimal set of layers, which wouldn't require retraining the detection network.

rbgreenway commented 4 years ago

Use the same features to track that were used to detect. This makes so much sense. Very excited that this is being worked on. Alexy, thanks for your excellent work and dedication to this technology.

gameliee commented 4 years ago

I personally think that a detection network tries to generalize the characteristics of a class, as all people have to have similar features. On the other hand, tracking requires contrastiveness, a person's feature needs to be as far as possible from the rest of the people. Training a network serving those conflict targets might be unrealistic. That's why we haven't found such a network like that.

pfeatherstone commented 4 years ago

Yeah. I think that's why you might need to attach an auxiliary network that uses some of the features output by the early inner layers of the detection network, that is trained separately using similarity loss or contrastive loss. Basically you end up with deepsort, but instead of using a completely separate VGG16 network, you're using X% of the detection network and (100-X)% of a new auxiliary network. It feels slightly more efficient and faster to recycle features. I like deepsort because all the re-id is undertaken by the hungarian algorithm and all you need is a good metric, and in this case, some good tracking features to evaluate the metric on. The only problem with deepsort is that it is super slow. So recyling features from a detection network seems like a good way to solve that. But it would be interesting to see if @AlexeyAB's idea of a fully blown LTSM is better and more accurate.

AlexeyAB commented 4 years ago

Yes, these 2 tasks (Detection and Reidentification) partially contradict each other. Therefore, is expected: a slight decrease in accuracy in images, but a large increase in accuracy in video. I think we can regulate this through the normalization factor.

But perhaps the task of re-identification will make network remember more details of objects, that theoretically can improve detection even in images for large networks (networks with high capacity).

pfeatherstone commented 4 years ago

you make a good point. Maybe re-id using LSTM will make the detection network pay special attention to object features. If you're right, then maybe you need to write a paper with title "LSTM is all the attention you need" :)

AlexeyAB commented 4 years ago

Conv-LSTM + Constrastive loss (Embeddings for cosine similarity) + Yolo

This is done, just we should improve it and test it.

pfeatherstone commented 4 years ago

Awesome! Have you posted some weights?

AlexeyAB commented 4 years ago

Not yet.

AlexeyAB commented 4 years ago

There is the Proof of concept cfg-file, you can try to train Constrastive loss (Embeddings for cosine similarity) + Yolo (without Conv-LSTM) on any non-sequence-datasest (MSCOCO, BDD, OpenImages, PascalVOC, ...): yolov3-tiny_contrastive.cfg.txt

Train as usual, without pre-trained weights or with https://drive.google.com/file/d/18v36esoXCh-PsOKwyP2GWrpYDptDY8Zf/view?usp=sharing

This model will count your objects, when you will run it on Video ./darknet detector demo data/sobj.data yolov3-tiny_contrastive.cfg backup/yolov3-tiny_contrastive_last.weights video.avi -out_filename out_video.avi

You can play with parameters for Detection (after training):

[yolo]
# for tracking
track_history_size = 5 - find similiraty on 5 previous frames
sim_thresh = 0.8 - similarity threshold to consider an object on two frames the same
dets_for_show = 2 - number of frames with this object before Show it
dets_for_track = 8 - number of frames with this object before Track it
track_ciou_norm = 0.3 - take into account CIoU (0.0 to 1.0)
pfeatherstone commented 4 years ago

Thanks @AlexeyAB

AlexeyAB commented 4 years ago

I re-upload contrastive detection model yolov3-tiny_contrastive.cfg.txt with [contrastive] cls_normalizer=1.0 - use this (previously I uploaded [contrastive] cls_normalizer=0.0 – so contrastive loss was disabled)

AlexeyAB commented 4 years ago

Currently there is used very simple and inaccurate model: https://drive.google.com/file/d/1g18BgkIRbZGykHYxKvWgH1_UUTQnONwG/view?usp=sharing

Simple test on small dataset:

Contrastive loss is enabled [contrastive] cls_normalizer=1.0 Contrastive loss is disabled [contrastive] cls_normalizer=0.0 Contrastive, flip and jitter are enabled [net] contrastive_jit_flip=1
det_cl_fwonly_mi_b64 det_cl_fwonly_mi_b64_cl-disabled det_cl_fwonly_mi_b64_jitter
pfeatherstone commented 4 years ago

This is good news

MsWik commented 4 years ago

What could be the problem?

Warning: in txt-labels class_id=-2147483648 >= classes=1 in cfg-file. In txt-labels class_id should be [from 0 to 0]

truth.x = 0.000000, truth.y = -nan, truth.w = 0.000000, truth.h = 0.000000, class_id = -2147483648

Wrong label: truth.x = -0.000000, truth.y = 0.000000, truth.w = -0.000000, truth.h = 0.000000 Wrong label: truth.x = -nan, truth.y = -885731875545466011648.000000, truth.w = -0.000000, truth.h = 0.000000 Wrong label: truth.x = 0.000000, truth.y = 0.000000, truth.w = -0.000000, truth.h = 0.000000 v3 (mse loss, Normalizer: (iou: 0.75, cls: 1.00) Region 38 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 3, class_loss = 0.750000, iou_loss = -nan, total_loss = -nan Contrast accuracy = 0.000000 % Error: N == 0 || temperature == 0 || vec_len == 0. N=1.000000, temperature=nan, vec_len=0.000000

Standard Labeling, 1 class (0)

0 0.241 0.854 0.149 0.285 0 0.518 0.529 0.126 0.24

With other cfg there is no problem.

AlexeyAB commented 4 years ago

Look at bad.list and bad_label.list files.

Try to download the latest Darknet version, and recompile. I added minor fix.

What dataset do you use and how many images?

MsWik commented 4 years ago

Thank. Yes, that helped. I need to find and track cars. I use COCO-17 + my data (~ 15000 photos).

AlexeyAB commented 4 years ago

Yes, it will work. But it will work much better when yolov4-tiny-contrastive.cfg and full yolov4-contrastive.cfg models will be implemented. Currently there is used very simple and inaccurate model: https://drive.google.com/file/d/1g18BgkIRbZGykHYxKvWgH1_UUTQnONwG/view?usp=sharing

pfeatherstone commented 4 years ago

Are you training these models on MOT datasets? Not quite sure how the contrastive loss works on COCO if no two objects are related. Doesn't that mean you only ever have negative samples. Do you need tracked objects to get positive samples? This is probably a stupid question.

AlexeyAB commented 4 years ago
pfeatherstone commented 4 years ago

I see, so the key here is augmentation

pfeatherstone commented 4 years ago

If the constrastive loss depends highly on augmentation, should there be other augmentation transformations like stretching, warping, etc?

AlexeyAB commented 4 years ago

@pfeatherstone

other augmentation transformations like stretching, warping, etc?

What is the warping?

It will use:

[net]
contrastive_jit_flip=1

[yolo]
jitter=0.3

It depends on how much the same object can differ in two frames:


Also it depends on model:

So we should use strong data augmentation only for big yolov4-contrastive model rather than for yolov4-tiny-contrastive.

MsWik commented 4 years ago

Hello. Tell me if there will be enough 1 [contrastive] layer in yolov4 / yolov4-tiny or need more? We write parameter tracking in only one yolo layer ? Thanks for your work

AlexeyAB commented 4 years ago

There should be 1 [contrastive] layer and 1 - 5 [yolo] layers.

It will be tested. For simplicity of implementation and calculation of embeddings, [contrastive] layer can work only with 1 [yolo] layer so far.

MsWik commented 4 years ago

Hello. Thanks for the answer. I got a good graph but no object number. I am not quite sure what the "embedding_layer" should be pointing to. Attached is the cfg and training schedule.

yolov4-tiny-3l-supertest.cfg.txt result

alexanderfrey commented 4 years ago

Do you have an example on how to extract the embeddings for a given frame ?

AlexeyAB commented 4 years ago

Hello. Thanks for the answer. I got a good graph but no object number. I am not quite sure what the "embedding_layer" should be pointing to. Attached is the cfg and training schedule.

yolov4-tiny-3l-supertest.cfg.txt result

Later will be provided better cfg/weights file for MS COCO.

AlexeyAB commented 4 years ago

Do you have an example on how to extract the embeddings for a given frame ?

If you use DLL/SO library:

C API: https://github.com/AlexeyAB/darknet/blob/20760d29715bb34bb2fbd0a05318dafe8150b325/include/darknet.h#L864-L878

Python example: https://github.com/AlexeyAB/darknet/blob/20760d29715bb34bb2fbd0a05318dafe8150b325/darknet.py#L56-L68

Using ./darknet detector demo ... you will see track_id

MsWik commented 4 years ago

Labels are displayed but not correct. I tried to change sim_thresh. What could be the reason? Thanks in advance for your reply. yolov4-nn-2207n.cfg.txt

AlexeyAB commented 4 years ago

Your cfg-file is incorrect.

[yolo] embedding_layer = 8 can't point to one of the next layers. It should point to the one of the previous layers.

Anafeyka commented 4 years ago

Your cfg-file is incorrect.

[yolo] embedding_layer = 8 can't point to one of the next layers. It should point to the one of the previous layers.

if [contrastive] classes=1 temperature=1.0 yolo_layer= -8 cls_normalizer=1.0 max_delta=10

then embedding_layer = -4 Did I understand correctly? testTiny4Contr.txt

AlexeyAB commented 4 years ago

[yolo] embedding_layer = -4 should point to the conv-layer which precedes the [contrasting] layer.

Don't change last 7 layers: https://github.com/AlexeyAB/darknet/files/4887782/yolov3-tiny_contrastive.cfg.txt You can only change other layers, and you can change parameters in [yolo] layer to make it yolov4-tiny.

Anafeyka commented 4 years ago

[yolo] embedding_layer = -4 should point to the conv-layer which precedes the [contrasting] layer.

Don't change last 7 layers: https://github.com/AlexeyAB/darknet/files/4887782/yolov3-tiny_contrastive.cfg.txt You can only change other layers, and you can change parameters in [yolo] layer to make it yolov4-tiny.

Like this. Or do I need to sleep testTiny4Contr2.txt

AlexeyAB commented 4 years ago

@Anafeyka Yes!

Anafeyka commented 4 years ago

@Anafeyka Yes!

thank! Now I understand how it works

rbgreenway commented 4 years ago

Is there an example cfg file for full YoloV4 network that includes the embedding? I'd really like to train a network and test the tracking based on the calculated embedding. Thanks for all the great work you do, AlexeyAB. I've learned a lot from this site, and it is greatly appreciated.

MsWik commented 4 years ago

Hey. Tell me how to get similarity identifiers through the py script? Can I use cv 4.4 / cuDNN with the [contrastive] layer? Thank.

sctrueew commented 4 years ago

@AlexeyAB Hi,

How can I use tracking? Could you please give us an example?

Thanks

haviduck commented 4 years ago

trying to understand @Anafeyka cfg, and wanting to try with 2 classes, what would the convolutional filters be at the end? //followup here. its masks * (classes + 1 + 4).
Its training now, i'm really excited to see the results :)

jylink commented 3 years ago

@MsWik Hi,

Have you successfully trained full yolov4+contrastive? Could you please share the cfg? My version always throw me random segmentation faults

Thanks

scianand commented 3 years ago

Hi Please can you provide me cfg of yolov3-tiny_contrastive model cause I am having segmentation fault errors?

scianand commented 3 years ago

Please someone can tell me what's the reason behind this segmentation fault error?

mrxuehb commented 3 years ago

I have also encountered segmentation fault when using yolov3-tiny_contrastive.cfg. I set classes = 7, and filters=108. It can be trained for 200-600 iterations randomly before failing with segmentation error. Is this yolov3-tiny_contrastive.cfg can only used with classes = 1 currently? Or is there other problems?

MsWik commented 3 years ago

Sori, I haven't watched the topic for a long time. If relevant - here's my cfg. yolov4-nn-22_07.cfg.txt

javier-box commented 3 years ago

Hello @MsWik ... Did you get contrastive to work? Can you share your test result?

I'm also wondering how to add contrastive layer to tkdnn / cudnn.