Gitsamshi / WeakVRD-Captioning

Implementation of paper "Improving Image Captioning with Better Use of Caption"
32 stars 7 forks source link

Positive Bag Negative Bag #19

Open Monikshah opened 3 years ago

Monikshah commented 3 years ago

I am trying to reproduce your model for weak supervised multi instance learning, I am a bit confused about the formation of positive and negative bag. It says in the paper, for a predicate r associated with object region pair, the region pair will be labeled as positive bag if the predicate r is in the caption S. My question is the predicates are extracted from the triplets and the triplets are extracted from the caption so the predicate with always be present in the caption.

How to label the positive and negative bag. Can you please help me understand this?

Thank you very much.

Gitsamshi commented 3 years ago

Thanks for asking. Given a image with n regions, there are n(n-1) region pairs. There maybe k pairs related to predicate r, which is extracted from caption. The k makes up a pos bag, and the rest n(n-1)-k makes up a neg bag.

What we do care in training are pos bags as pairs in neg bags are all labelled as negative (same with binary classification).

You may view the MIL learning as a better method compared to simply assigning every pair in k pairs a postive label for binarily classifying predicate r.

Monikshah commented 3 years ago

Thank you very much for the response. I am very new to this field of research and and I am struggling from around a month to implement this model.

So what I understand is there can be k pairs related to predicate which is extracted from the caption. For the caption "a women in hat feeds the giraffee from her hand", the triplets extracted are 'women in hat' and 'women feeds giraffe'. The predicates are 'in' and 'feed'. So the k pairs are all the pairs for 'women' and 'hat' for predicate 'in' and all the pairs of 'women' and giraffe for predicate 'feed'. Only these k pairs will make up as a positive bag and the rest pairs such as women in giraffee, women feed hat etc. will make up as negative bag right? Also do we need to assign all the rest pair or objects as negative bag?

Also for the formation of bags, we have object features (att_feat) from the images and labels of these objects (coco_img_sg) and the triplets from the sentences (coco_spice_sg2). Now for the object pairs (att_feat concatenated) we check if their respective labels (from coco_img_sg) for a predicate is present in the triples extracted, if true we assign it as positive bag else -ve bag. The object labels detected from image should match the objects in the triplets?

Am I making my understanding very complicated? Please let me know if my understanding is right. Please correct me if I am wrong. Thank you.

Gitsamshi commented 3 years ago

Hi there, Pls refer to scripts/prepro_predicates.py for getting positive bags. It should have answered your second question. For the first question, let's first make it clear about "pair" and "bag". (1) A pair is defined across any two regions in the image, and a bag is a combination of pairs.
(2) The reason why we use bags is that we are not sure about the label of some pair. If we are certain about the labels between one pair, just use regular classification, e.g., pairs in the negative bags.
(3) In our case, you have to train multiple binary predicate classifiers, and pos bags are different regarding predicate. Assuming there are k1 for 'in' and k2 for 'feed' (k1+k2=k), so two pos bags and the rest pairs are labeled as negative for any predicate. (4) For the negative pairs, just use regular cross-entropy loss; And for pos bags, just compute the bag possibility based on pair possibility and then use cross-entropy loss on the bag.

Hope the above helps, pls feel free to leave further comments

Monikshah commented 3 years ago

Thank you very much. This gives me the answers of so many of my questions.

I checked the scripts/prepro_predicates.py. It gives the predicates and the triples from images and sentences. From these triples generated can we directly form the positive bag or there is a process such as matching the object categories of the images with the objects from the triples and form the positive bag? I hope this makes sense.

Thank you.

Gitsamshi commented 3 years ago

Yeah, there is a matching process in the code.

Monikshah commented 3 years ago

There are two files used in the scripts/prepro_predicates.py which I could not find to download:

  1. '--pred_category', default='data/all_predicates_final.json', help='get all predicates'
  2. '--aligned_triplets', default='data/aligned_triplets_final.json', help='get aligned weak supervision'

Are these the predicates and the triples extracted from sentences respectively. Do I save all the predicates as "all_predicates_final.json" and all the triples as 'aligned_triplets_final.json' and put them in data folder to run the code?

Monikshah commented 3 years ago

I got answer to these.

Monikshah commented 3 years ago

Sorry to bother you again and thank you for patiently responding :)

I am not understanding which part in scripts/prepro_predicates.py does the matching and what variable represents the object pairs to put in the positive bag. So I will write code on my own for this part. I will write the steps to form the bags. Please let me know if the process is right

  1. I have the predicates and triples from the sentences, also the objects detected in the images and their labels
  2. Form the object pairs (OP) for the objects detected from the images.
  3. For all the predicates in the triples check if the objects pairs are in the triples. If true put these pairs of objects in the positive bag lets consider the triples provided in the paper, triples: women in hat, women feeds giraffe predicates: in, feed object labels: women, hat, giraffe object pairs: women-hat, women-giraffe, hat-giraffe for predicate 'in' check if these object pairs matches the subject and object in the triples if true put the features of the object pairs in the positive bag ( all the pairs of women and hat)

We do the same for all the predicates i.e. 200*2 bags for 200 predicates.

Thank you :)

Gitsamshi commented 3 years ago

Exactly, remember to use "data/coco_class_names.txt" to map object labels and caption words.

Monikshah commented 3 years ago

I have been going through scripts/prepro_predicates.py. I came to understand that all_predicates_final.json contains all the predicates and aligned_triplets_final.json contains the pairs of objects for each predicates. So I understand now that these object pairs can directly be used to form the positive bags. But I am still wondering how to form negative bag. Can I use the same pairs for the respective predicate to form negative bag?

These questions might look dum but I am really layman and struggling to put all the pieces together.

Thank you

Gitsamshi commented 3 years ago

Given a predicate, the neg bag is the complement to the positive bag in terms of all pairs.

Monikshah commented 3 years ago

Great! Thank you very much for all the answers. Hopefully I should be able to implement now.

Monikshah commented 3 years ago

Hello again,

I have been able to implement the bag model. I trained the model. Now while testing we should be able to predict the predicates in the images by using just the objects pairs in the images right ?

Also we will need the ground truth predicates in the images to evaluate the model. I don't see any file which contains the ground truth labels of predicates. Can you please provide me some information about how/where can I get the test data?

Thank you!

Gitsamshi commented 3 years ago

Exactly. You can use the same split as Karparthy split for train/dev/test. As we didn't have predicate annotation between each pair, I just used predicate recall over the whole image (#(predicted predicates ∩ gt predicates) / #(gt predicates)) as a metric to roughly evaluate the model.

Monikshah commented 3 years ago

Thank you very much for responding.

I think Karparthy's split for train/dev/test does not have predicate annotation over the whole image. How did you get the ground truth predicates?

Monikshah commented 3 years ago

Do you consider the predicates in the captions of each test image as the ground truth predicates?

Gitsamshi commented 3 years ago

Yes, I mean cut the karparthy train split into train/dev/test, and use caption predicates as reference.

Monikshah commented 3 years ago

Awesome! Thank you very much. :)

Monikshah commented 3 years ago

I have one more question. We develop positive and negative bag for each predicate and train them separately. If we train them separately we will have separate models. For example, 10 models for 10 predicates. Do we need to combine these 10 models into one or do we get just one model while training? (Because eventually we need just one model to predict any predicates right?)

Gitsamshi commented 3 years ago

I would suggest adding 10 different top layer classifiers for each predicate and sharing other params between all predicates. That makes one model.

Monikshah commented 3 years ago

Sure. Thank you very much!

ababababababababababab commented 2 years ago

Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?

Monikshah commented 2 years ago

Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?

I have not been able to train the model yet. I am still working on the MIL part.

What library are you using for multi-instance learning?

Monikshah commented 2 years ago

Hello author,

Thank you so much for being very helpful. I have some more questions, It would very kind of you if you can share how you resolved these;

  1. How did you combine the visual features and the boundary box coordinates of the objects?
  2. After getting the many binary predicate models (classifiers), how do you combine these into one?

Thank you!