fpv-iplab / rulstm

Code for the Paper: Antonino Furnari and Giovanni Maria Farinella. What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention. International Conference on Computer Vision, 2019.
http://iplab.dmi.unict.it/rulstm
133 stars 33 forks source link

Training data of Object Detector #9

Closed staceycy closed 4 years ago

staceycy commented 4 years ago

Hi,

Thank you very much for your awesome work!

Could you please tell me what dataset you used to train the object detector? If you are using the EPIC-Kitchen object detection dataset, how did you convert the class labels, as the object detection dataset contains 295 class which is less than the number of nouns (351 classes) in Action Anticipation task?

Thank you.

Best, Stacey

antoninofurnari commented 4 years ago

Hello,

Thank you for your interest in our work!

Yes, we have used the EPIC-Kitchens dataset to train the detector. It is true that some noun classes are not present in the training set, and hence the total number of objects is smaller than the official number of 352 classes.

In our experiments, we set up the object detector to recognize the 352 classes reported in https://github.com/epic-kitchens/annotations/blob/master/EPIC_noun_classes.csv. Then we trained the detector using the labels provided in https://raw.githubusercontent.com/epic-kitchens/annotations/master/EPIC_train_object_labels.csv. Specifically, the noun_class column of the latter file corresponds to the noun_id column of the former file.

Since some labels never appear, with this setup, the model will never observe some of the objects during training, but it will still output boxes for 352 classes (probably detecting nothing or garbage in the classes not seen during training).

Hope this helps.

Best, Antonino

staceycy commented 4 years ago

@antoninofurnari Thank you very much for your detailed reply. I understand it now :)

staceycy commented 4 years ago

Hi @antoninofurnari, I have another question regarding the object detector.

I noticed that the RGB image for object detection (1920*1080) has a much much larger resolution than those for action recognition (456*256). May I know how you deal with the image size difference between training and testing images? Did you resize the frames in action recognition to a larger resolution when you do the inference?

Thank you very much.

Best, Stacey

antoninofurnari commented 4 years ago

Hello Stacey,

To avoid any bias in object detection, we processed each frame at its full resolution, which is 1920*1080 most of the times, but sometimes is 1440p. To avoid extracting all frames at full resolution, I modified a script originally included in the Detectron library, to extract bounding boxes from each frame of a video. You can find the script here: https://github.com/fpv-iplab/rulstm/blob/master/FasterRCNN/tools/detect_video.py.

I didn't try to upsample the low resolution frames because I was afraid that could harm the detection of small objects.

Hope this helps.

Best, Antonino

staceycy commented 4 years ago

Hi Antonino,

Thank you very much for your detailed explanation.

I have checked the code, and it is nice! May I ask what frame rate you are using to extract frames from videos? Is it the default one in OpenCV?

Thank you again for your kind help.

Best, Stacey

antoninofurnari commented 4 years ago

Since some videos in EPIC-Kitchens have different framerates, I have converted all videos to a fixed framerate of 30fps as discussed in https://github.com/fpv-iplab/rulstm/issues/3#issuecomment-562628974.

I then used the converted videos as input to detect_video.py. The conversion also makes sure that opencv can succefully decode the video.

Best, Antonino

staceycy commented 4 years ago

I got it. Thanks a lot! Please take care and stay healthy :)

Best, Stacey

antoninofurnari commented 4 years ago

Sure you too!

Glad to help :) Antonino