Closed staceycy closed 4 years ago
Hello,
Thank you for your interest in our work!
Yes, we have used the EPIC-Kitchens dataset to train the detector. It is true that some noun classes are not present in the training set, and hence the total number of objects is smaller than the official number of 352 classes.
In our experiments, we set up the object detector to recognize the 352 classes reported in https://github.com/epic-kitchens/annotations/blob/master/EPIC_noun_classes.csv. Then we trained the detector using the labels provided in https://raw.githubusercontent.com/epic-kitchens/annotations/master/EPIC_train_object_labels.csv. Specifically, the noun_class
column of the latter file corresponds to the noun_id
column of the former file.
Since some labels never appear, with this setup, the model will never observe some of the objects during training, but it will still output boxes for 352 classes (probably detecting nothing or garbage in the classes not seen during training).
Hope this helps.
Best, Antonino
@antoninofurnari Thank you very much for your detailed reply. I understand it now :)
Hi @antoninofurnari, I have another question regarding the object detector.
I noticed that the RGB image for object detection (1920*1080) has a much much larger resolution than those for action recognition (456*256). May I know how you deal with the image size difference between training and testing images? Did you resize the frames in action recognition to a larger resolution when you do the inference?
Thank you very much.
Best, Stacey
Hello Stacey,
To avoid any bias in object detection, we processed each frame at its full resolution, which is 1920*1080 most of the times, but sometimes is 1440p. To avoid extracting all frames at full resolution, I modified a script originally included in the Detectron library, to extract bounding boxes from each frame of a video. You can find the script here: https://github.com/fpv-iplab/rulstm/blob/master/FasterRCNN/tools/detect_video.py.
I didn't try to upsample the low resolution frames because I was afraid that could harm the detection of small objects.
Hope this helps.
Best, Antonino
Hi Antonino,
Thank you very much for your detailed explanation.
I have checked the code, and it is nice! May I ask what frame rate you are using to extract frames from videos? Is it the default one in OpenCV?
Thank you again for your kind help.
Best, Stacey
Since some videos in EPIC-Kitchens have different framerates, I have converted all videos to a fixed framerate of 30fps as discussed in https://github.com/fpv-iplab/rulstm/issues/3#issuecomment-562628974.
I then used the converted videos as input to detect_video.py. The conversion also makes sure that opencv can succefully decode the video.
Best, Antonino
I got it. Thanks a lot! Please take care and stay healthy :)
Best, Stacey
Sure you too!
Glad to help :) Antonino
Hi,
Thank you very much for your awesome work!
Could you please tell me what dataset you used to train the object detector? If you are using the EPIC-Kitchen object detection dataset, how did you convert the class labels, as the object detection dataset contains 295 class which is less than the number of nouns (351 classes) in Action Anticipation task?
Thank you.
Best, Stacey