Open Monikshah opened 3 years ago
Thanks for asking. Given a image with n regions, there are n(n-1) region pairs. There maybe k pairs related to predicate r, which is extracted from caption. The k makes up a pos bag, and the rest n(n-1)-k makes up a neg bag.
What we do care in training are pos bags as pairs in neg bags are all labelled as negative (same with binary classification).
You may view the MIL learning as a better method compared to simply assigning every pair in k pairs a postive label for binarily classifying predicate r.
Thank you very much for the response. I am very new to this field of research and and I am struggling from around a month to implement this model.
So what I understand is there can be k pairs related to predicate which is extracted from the caption. For the caption "a women in hat feeds the giraffee from her hand", the triplets extracted are 'women in hat' and 'women feeds giraffe'. The predicates are 'in' and 'feed'. So the k pairs are all the pairs for 'women' and 'hat' for predicate 'in' and all the pairs of 'women' and giraffe for predicate 'feed'. Only these k pairs will make up as a positive bag and the rest pairs such as women in giraffee, women feed hat etc. will make up as negative bag right? Also do we need to assign all the rest pair or objects as negative bag?
Also for the formation of bags, we have object features (att_feat) from the images and labels of these objects (coco_img_sg) and the triplets from the sentences (coco_spice_sg2). Now for the object pairs (att_feat concatenated) we check if their respective labels (from coco_img_sg) for a predicate is present in the triples extracted, if true we assign it as positive bag else -ve bag. The object labels detected from image should match the objects in the triplets?
Am I making my understanding very complicated? Please let me know if my understanding is right. Please correct me if I am wrong. Thank you.
Hi there,
Pls refer to scripts/prepro_predicates.py for getting positive bags. It should have answered your second question.
For the first question, let's first make it clear about "pair" and "bag".
(1) A pair is defined across any two regions in the image, and a bag is a combination of pairs.
(2) The reason why we use bags is that we are not sure about the label of some pair. If we are certain about the labels between one pair, just use regular classification, e.g., pairs in the negative bags.
(3) In our case, you have to train multiple binary predicate classifiers, and pos bags are different regarding predicate. Assuming there are k1 for 'in' and k2 for 'feed' (k1+k2=k), so two pos bags and the rest pairs are labeled as negative for any predicate.
(4) For the negative pairs, just use regular cross-entropy loss; And for pos bags, just compute the bag possibility based on pair possibility and then use cross-entropy loss on the bag.
Hope the above helps, pls feel free to leave further comments
Thank you very much. This gives me the answers of so many of my questions.
I checked the scripts/prepro_predicates.py. It gives the predicates and the triples from images and sentences. From these triples generated can we directly form the positive bag or there is a process such as matching the object categories of the images with the objects from the triples and form the positive bag? I hope this makes sense.
Thank you.
Yeah, there is a matching process in the code.
There are two files used in the scripts/prepro_predicates.py which I could not find to download:
Are these the predicates and the triples extracted from sentences respectively. Do I save all the predicates as "all_predicates_final.json" and all the triples as 'aligned_triplets_final.json' and put them in data folder to run the code?
I got answer to these.
Sorry to bother you again and thank you for patiently responding :)
I am not understanding which part in scripts/prepro_predicates.py does the matching and what variable represents the object pairs to put in the positive bag. So I will write code on my own for this part. I will write the steps to form the bags. Please let me know if the process is right
We do the same for all the predicates i.e. 200*2 bags for 200 predicates.
Thank you :)
Exactly, remember to use "data/coco_class_names.txt" to map object labels and caption words.
I have been going through scripts/prepro_predicates.py. I came to understand that all_predicates_final.json contains all the predicates and aligned_triplets_final.json contains the pairs of objects for each predicates. So I understand now that these object pairs can directly be used to form the positive bags. But I am still wondering how to form negative bag. Can I use the same pairs for the respective predicate to form negative bag?
These questions might look dum but I am really layman and struggling to put all the pieces together.
Thank you
Given a predicate, the neg bag is the complement to the positive bag in terms of all pairs.
Great! Thank you very much for all the answers. Hopefully I should be able to implement now.
Hello again,
I have been able to implement the bag model. I trained the model. Now while testing we should be able to predict the predicates in the images by using just the objects pairs in the images right ?
Also we will need the ground truth predicates in the images to evaluate the model. I don't see any file which contains the ground truth labels of predicates. Can you please provide me some information about how/where can I get the test data?
Thank you!
Exactly. You can use the same split as Karparthy split for train/dev/test. As we didn't have predicate annotation between each pair, I just used predicate recall over the whole image (#(predicted predicates ∩ gt predicates) / #(gt predicates)) as a metric to roughly evaluate the model.
Thank you very much for responding.
I think Karparthy's split for train/dev/test does not have predicate annotation over the whole image. How did you get the ground truth predicates?
Do you consider the predicates in the captions of each test image as the ground truth predicates?
Yes, I mean cut the karparthy train split into train/dev/test, and use caption predicates as reference.
Awesome! Thank you very much. :)
I have one more question. We develop positive and negative bag for each predicate and train them separately. If we train them separately we will have separate models. For example, 10 models for 10 predicates. Do we need to combine these 10 models into one or do we get just one model while training? (Because eventually we need just one model to predict any predicates right?)
I would suggest adding 10 different top layer classifiers for each predicate and sharing other params between all predicates. That makes one model.
Sure. Thank you very much!
Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?
Can I ask you some details about the training? For different predicates, the division of positive bags and negative bags is different. How do you realize batch training? In addition, what is the probability threshold after training the classifier?
I have not been able to train the model yet. I am still working on the MIL part.
What library are you using for multi-instance learning?
Hello author,
Thank you so much for being very helpful. I have some more questions, It would very kind of you if you can share how you resolved these;
Thank you!
I am trying to reproduce your model for weak supervised multi instance learning, I am a bit confused about the formation of positive and negative bag. It says in the paper, for a predicate r associated with object region pair, the region pair will be labeled as positive bag if the predicate r is in the caption S. My question is the predicates are extracted from the triplets and the triplets are extracted from the caption so the predicate with always be present in the caption.
How to label the positive and negative bag. Can you please help me understand this?
Thank you very much.