Knowledge Distillation for Smart Distancing

alpha-carinae29 commented 4 years ago

I have an idea to use knowledge distillation for each new environment and and train an environment specific models. @mhejrati what is your idea?

mhejrati commented 4 years ago

I think a good starting point is evaluating a few strong object detector and see how they will perform on Oxford Town Center and similar datasets without fine-tuning. Then do an end-to-end experiment with distillation and see if we can get any acceptable results before starting to optimize for edge devices.

Do you have a list of candidate detector models to test?

alpha-carinae29 commented 4 years ago

yeah I am testing some SOTA object detectors on Town Center. here is the list:

Title	Papar	Code	mAP on Town Center
Faster RCNN with NAS	Paper	Code	83.43
ResNeSt: Split-Attention Networks	Paper	Code	-
IterDet: Iterative Scheme for Object Detection in Crowded Environments	Paper	Code	91.59 on Crowd Human Dataset / 86.61 on Wider Person Dataset
Memory Enhanced Global-Local Aggregation for Video Object Detection	Paper	Code	Canceled

This will be updated

alpha-carinae29 commented 4 years ago

For now I am testing Faster RCNN model on town center and try to create a distillation pipeline.

alpha-carinae29 commented 4 years ago

Hi everyone I wanted to give you an update. I tested a Faster RCNN model which is trained with NAS model selection strategies on Oxford Town Center Dataset. (You can download model from TensorFlow Model Zoo) The mAP of this model was 83.43 on first 4501 frames of the Oxford Town Center dataset and the Frame Rate on a Nvidia RTX 1070 Super GPU was 1 FPS. Then I performed some post processing such as box filtering based on background subtraction and filter very large boxes to clean predictions. Finally I used the predictions of this model as ground truths for training a SSD mobileNet V2 model which is so much lighter and faster than RCNN model. The mAP on last 1000 frames of Town Center (validation dataset) for this student model was 15 % lower than The teacher model (RCNN) and frame rate was 100 times higher than the teacher model (100 FPS) Then I tried to substitute the green channel of each RGB frame with a foreground mask and train a new SSD MobileNet V2 as the student model. This time the mAP for validation dataset was only 13% lower than Teacher model. Now I am trying to add more features to the input image such as optical flow to improve the performance of the student model.

alpha-carinae29 commented 4 years ago

Hi everyone. I ran another experiment by using 3 different input channel for training the student model. channel 1: gray scale of each frame channel 2: foreground mask of each frame channel 3. magnitude of optical flow of each frame and the performance improved by 3% in comparison with last experiment (replacing foreground mask with one RGB channel) and now the mAP of this student model is just 10% lower than the huge teacher model.

alpha-carinae29 commented 4 years ago

Now I am trying to use this model as the teacher model. There is not a inference code for testing an arbitrary video file for this model so I am trying to implement this part.

undefined-references commented 4 years ago

Hi @alpha-carinae29, Interesting idea, I've joined to this and trying to use ResNeSt: Split-Attention Networks model as the teacher model.

alpha-carinae29 commented 4 years ago

Thank you @emma-w-dev . I am very happy that you find this topic interesting. Testing different teacher models is a big step and can boost the results significantly. Please share your experiences and updates here. Thanks

alpha-carinae29 commented 4 years ago

Hi everyone, I stopped working on Memory Enhanced Global-Local Aggregation for Video Object Detection as teacher model since the checkpoints of this model was from ImageNet Vid dataset and this dataset has no pedestrian class on its categories. So I tried to use IterDet model which has checkpoints on multiple pedestrian detection datasets. The result on Oxford Town Center was interesting. It has around 5 FPS frame rate and 91.59 mAP on first 4500 frames of Oxford Town Center. (for model which was trained on Crowd Human dataset.) Now I want to train a student model using predictions of this teacher model. stay tuned :)

galliot-us / neuralet

Knowledge Distillation for Smart Distancing #108