DirtyHarryLYL / Transferable-Interactiveness-Network

Code for Transferable Interactiveness Knowledge for Human-Object Interaction Detection. (CVPR'19, TPAMI'21)
MIT License
228 stars 41 forks source link

about the hico-det metric? #3

Closed ZHUXUHAN closed 5 years ago

ZHUXUHAN commented 5 years ago

hi,man,i am very confused about the metric on the hico-det datasets metric,as I see,the map is conculated by the mean average precision for every hoi-class,the bbox is from the ground-truth or from the detector's results, can you answer about it? and if you use the detector's result to compute the map, can you tell me your detector's map for the hico-det datasets and do you train the detector ? I will pay the highest respect

ZHUXUHAN commented 5 years ago

for the hico-datasets,there is no key-points annotations, how can you do the work for the key-points stream, I am very confused,because I just use the openpose to do the same work, it not works, maybe can you tell the details about how do you do the key points stream

ZHUXUHAN commented 5 years ago

for hico-datasets,there exits unbalanced data for having-interaction and no-interaction,how do you train the nis as a prior message, and can you tell the nis's metric, thanks.

DirtyHarryLYL commented 5 years ago
  1. the bbox is from the ground-truth or from the detector's results ---from the detector. For specific, we use the Detectron results supplied by iCAN (https://github.com/vt-vl-lab/iCAN). Nope, we have not finetune or retrain the object detection model, just used the bbox results directly.
  2. We use Alphapose to do the pose estimation, https://github.com/MVIG-SJTU/AlphaPose. First, we get 17 keypoints of each human bbox, and then generate a visualized map of the skeleton. This map and two spatial maps following HICO-DET(wacv) are the inputs of a single stream. The visualized skeleton is generated by opencv's function.
  3. The ratio of pos and neg pairs for the training of Interactiveness network(binary classification) is 1:1. Better performance will be obtained by training interactiveness network solely. Then you can use this binary classifier to do NIS on any trained HOI models. We use a hard threshold to do the NIS. In my experience, this threshold need to be small enough, e.g. 0.1, because sometimes some images have few pos pairs, a larger threshold will kill too many pos pairs and hurt the performance.

Thanks for your attention. Now we are working on a bigger project on activity understanding which includes our TIN. We plan to make this project (with TIN) open source this summer.

ZHUXUHAN commented 5 years ago

oh, excellent answers, I am very grateful to you for taking the time to answer some of my confusions.i am doing the same work, but I work slowly, so you do an excellent work,and,the work on the activity understanding you are doing is so advacing, I just admire your academic ability,feeling the hope of the city.I pay tribute to your selfless dedication. 1.i just test the detector by deyectron's result, using the r50-fpn-1x (https://github.com/facebookresearch/Detectron/blob/master/MODEL_ZOO.md), and the bboxmap is just 20,I also see the ican project(by Chen Gao.2018-BMVC),because my work using pytorch,so I just don't eval it's detector, maybe it can performe better.i just think the detector's ability just decides the limit of hoi-classicition,so the detector is the key-point. 2 about the key-point stream, can you tell me how to generate a map of the skeleton by details,and twhat the result is like? 3 thanks for you answering, I just think wrongly.now I may do it by a right way.

ZHUXUHAN commented 5 years ago

about the detector, as I try on the detectron,it just leaks some object'bbox, especially for the rare object data,so this can make the hoi-det works badly, because the recall is just low.

DirtyHarryLYL commented 5 years ago
  1. object detection: yep, it is kinda important for HOI detection. If the detection results cannot reach a bar, the HOI performance will also be limited. I have done some tests for different detectors, e.g. fpn-resnet-101 and 152. But the performance difference is not very large. Thus I propose the LIS function, which can suppress the bad detections softly. In my opinion, the bad pairing of human and object is more influential.
  2. I put 17 points on a binary map, and set their grey values as 0.05, 0.10, 0.15....0.95, and link the points according to the human body skeleton. The edge width is 2 pixels. I have put some examples in our paper too.
    image In our code, the keypoints is paired with their corresponding human box. And the dataloader will generate tensors based on them as the inputs.
  3. Yeah, some rare classes are pretty tricky. Detectors trained on COCO may perform badly on HICO-DET. But ad hoc detection method or enhanced one for hico-det dataset has limitations too. In my opinion, maybe it is better to focus on the interaction understanding itself, we can use the same detections and only compare the interaction recognition performance. Good luck!
ZHUXUHAN commented 5 years ago

oh,yes,i will pay more attention on the interaction recognition.i just have another question. for hico-datasets,hypothesis,you just detect one human and one bottle, but there exits two hois,one,maybe hold the bottle, the other,drinking with the bottle,so how do you judge it as true positive? the question is many hois in one person and one obj, how do you metric on it?

DirtyHarryLYL commented 5 years ago

I don't exactly know the meaning of "metric". Do you intend to say how to decide a pair is pos or neg? Exactly speaking, HOI detection is a multi-label classification problem. 600 HOI classes are separate during the inference (600 sigmoids). So a pair can be pos to one class, and neg to another class. In our paper, we focus on the interactive or non-interactive pair problem. That is to say, whether a pair has one or more HOIs or zero HOI.

ZHUXUHAN commented 5 years ago

I don't exactly know the meaning of "metric". Do you intend to say how to decide a pair is pos or neg? Exactly speaking, HOI detection is a multi-label classification problem. 600 HOI classes are separate during the inference (600 sigmoids). So a pair can be pos to one class, and neg to another class. In our paper, we focus on the interactive or non-interactive pair problem. That is to say, whether a pair has one or more HOIs or zero HOI.

so, your metric code is like voc2007(or 2017)-map or what, that is how do you eval your model?because I just use different eval method, the result is different, so I am just very confused about that. if time is convenient for you, can you open you evaling code,thank you very much.

DirtyHarryLYL commented 5 years ago

I use the code supplied by ho-rcnn (https://github.com/ywchao/ho-rcnn).

ZHUXUHAN commented 5 years ago

hi, man,for hico-det datasets, how long do you train a model, and how much gpus?

ZHUXUHAN commented 5 years ago

I am very sorry, I am bothering you very often.i just meet one thing, that because this Is a multi-label questions,so. when you eval, do you select the top-n prediction or select the predictions that over a score, because, I just don't know how to eval on it correctly, so I don't konw if my method works, apologize again for my interruptions!

DirtyHarryLYL commented 5 years ago

You may find the hyper-parameters in our papers. We use a single Titan X. Yes, usually we need to discard some results with low scores, or detection scores, or NIS scores. You can also limit the pair numbers within each image.

ZHUXUHAN commented 5 years ago

for vcoco datasets, the train datasets only has 2k+ images, but the test datasets has 4k+ images, I just eval on it a little low.so do you just train the 2k+ images and eval on the 4k+ images?

DirtyHarryLYL commented 5 years ago

https://github.com/s-gupta/v-coco

ZHUXUHAN commented 5 years ago

https://github.com/s-gupta/v-coco when you eval the model for hico-datasets,do you just use the original Matlab code uploaded by ho-rcnn (https://github.com/ywchao/ho-rcnn)?

DirtyHarryLYL commented 5 years ago

For a fair comparison and exclude the influence of object detection, we use the detection results (Detectron) provided by iCAN, and follow their evaluation process.

You can find all the pkl files and data process here: https://github.com/vt-vl-lab/iCAN/blob/master/misc/download_dataset.sh https://github.com/vt-vl-lab/iCAN/blob/master/misc/download_detection_results.sh; https://github.com/vt-vl-lab/iCAN/blob/master/misc/download_training_data.sh.

Some details about the evaluation can be found here: https://github.com/vt-vl-lab/iCAN/blob/master/lib/ult/vsrl_eval.py, https://github.com/vt-vl-lab/iCAN/blob/master/lib/ult/Generate_HICO_detection.py.

Before the matlab evaluation, some pre-process must be done, e.g., discard the HOI results with wrong object classes according to the HOI class settings of HICO-DET and V-COCO.

The results process is tricky, if you use your own model and predictions, we may need to read the matlab code and its API carefully, and design your own post-process code. If you want to bypass this tricky course, you can also refer to the above code, which is carefully edited by the authors of iCAN (thanks for their great work).

ZHUXUHAN commented 5 years ago

You are such a humble and helpful researcher. I have seriously studied the engineering of ican, but my project is based on pytorch, so the whole process seems quite awkward. I am listening carefully to your answer, and I am gradually doing it,but your work is very good, I may not be able to reach it, but I am seriously studying this field, thank you very much for your patience and excellent answer, it's very helpful for me.

DirtyHarryLYL commented 5 years ago

I'm sorry I am not being helpful. Our project is based on TF too. Good luck, you can make it.

ZHUXUHAN commented 5 years ago

I'm sorry I am not being helpful. Our project is based on TF too. Good luck, you can make it.

I may just use other detector(may be trained by myself, it may be evalded better than the Ican's detector),when I eval it on vcoco datasets,may better than yours(10+%teller on AProle),it's teller too much ,and I just use my baseline model(tree stream model with person-keypoints(trained model from openpose) as prior message), not using other strategies, I may do the wrong evaling method,so do this can cause some unfair comparisons?

DirtyHarryLYL commented 5 years ago

50+ mAP?wow,that is so high. No idea, maybe something went wrong, or it is the benefit of better object detections. Maybe you can convert the ican's detections to your format, and use the same model to verify the 10+% difference.

ZHUXUHAN commented 5 years ago

50+ mAP?wow,that is so high. No idea, maybe something went wrong, or it is the benefit of better object detections. Maybe you can convert the ican's detections to your format, and use the same model to verify the 10+% difference.

I just made wrong about the evading code,i rewrites it by accuracy method,i then eval it on the val data, it can up to 43 on aprole,, i just read some papers, it's different from each other, do you eval on the val data or on the test data?