carrierlxk / COSNet

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks (CVPR19)
320 stars 62 forks source link

Question about unsupervision #12

Closed ellick53 closed 4 years ago

ellick53 commented 4 years ago

Hi, and thanks a lot for sharing your work. I am trying to train the network, and I have two questions: 1) You use the datasets DUTS and MSRA10K to train the 'object detection' part, correct?

2) And you do not need the annotations of the DAVIS dataset, right? I am confused about this code:

            if i_iter%3 ==0:
                pred1, pred2, pred3 = model(images, images)
                loss = 0.1*(loss_calc1(pred3, labels) + 0.8* loss_calc2(pred3, labels) )
                loss.backward()

            else:
                pred1, pred2, pred3 = model(target, search)
                loss = loss_calc1(pred1, target_gt) + 0.8* loss_calc2(pred1, target_gt) + loss_calc1(pred2, search_gt) + 0.8* loss_calc2(pred2, search_gt)#class_balanced_cross_entropy_loss(pred, labels, size_average=False)
                loss.backward()

If I understand correctly, the first part uses the DUTS + MSRA10K images and the second part the DAVIS images? But what is _targetgt and _searchgt? I thought the training was unsupervised.

Thanks for your help!

carrierlxk commented 4 years ago

Hi, the term 'unsupervised ' indicates that there is no human interaction or annotation during the test phase. While during the training phase, our model still needs labeled samples (the annotation of DAVIS dataset.) to train the whole network. So, the training is fully supervised.

ellick53 commented 4 years ago

I see, sorry for the misunderstanding. How well does the learning "transfers" to unseen classes? For instance, would it identify correctly moving objects like, say, a turtle, a kite or a football ball?

carrierlxk commented 4 years ago

Hi, the most core issue in unsupervised video object segmentation is to identify the primary object to be segmented. Traditional methods are mainly based on motion information, now lots of deep learning methods exploits saliency cues or local sequential information (convlstm, optical flow) to discriminate the primary object from the background. Our COSNet takes advantage of global temporal correlation information to identify the primary object.