andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
http://andrewowens.com/multisensory/
Apache License 2.0
220 stars 61 forks source link

Questions about the models #6

Open orthosiphon opened 6 years ago

orthosiphon commented 6 years ago

Very great work! The idea is very interesting and thank you for providing the codes.

After running the script download_models.sh, I found out that there are several pretrained models in the folder, which are cam, sep, and shift. I am a little bit confuse about which model for which purpose. For example, which is the model for,

Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).

Thank you.

andrewowens commented 6 years ago

Sorry that wasn't clear! You should use the "shift" model if you are using the pretrained network for downstream tasks (e.g. action recognition). Please let me know if you run into any issues with that (for reference, the "sep" model is for source separation, and "cam" is the class activation map model; we use it for localization).

orthosiphon commented 6 years ago

Thank you for your quick response!

I have other two questions: 1. In shift_example.py,

# Example using the pretarined audio-visual net. This example predicts
# a class activation map (CAM) for an input video, then saves a
# visualization of the CAM.

pr = shift_params.shift_v1()
model_file = '../results/nets/shift/net.tf-650000'
gpu = None

# uncomment for higher-resolution CAM (like the ones in the paper)
# pr = shift_params.cam_v1()
# model_file = '../results/nets/cam/net.tf-675000'

Does it means that since the beginning, the pretrained audio-visual features can detect sound source location since we can switch between the two models?

2. About the make_net function in shift_net.py, I guess there is normal net and also 'resized' net according to:

self.ims_ph = tf.placeholder(tf.uint8, [1, pr.sampled_frames, pr.crop_im_dim, pr.crop_im_dim, 3])
self.ims_resize_ph = tf.placeholder(tf.uint8, [1, pr.sampled_frames, None, None, 3])

predict_cam_resize() is used in 'shift_example.py` which corresponds to the second line in the code above. For the first line, is it for giving the net images after being cropped randomly?

Thank you! :smiley: :smiley:

andrewowens commented 6 years ago
  1. Yeah, you can compute sound localization using either net -- the "cam" net is just 2x the spatial resolution. In our paper, the sound localization comes from the network's attention map. To get the "cam" model we fine-tuned the "shift" model after removing spatial stride in the last layer of the net (i.e. it's just "shift" but with a higher resolution in the final layer).

  2. Yeah, that instantiation of the network simply resizes the images to be 224 x 224 before processing them. You could resize the images yourself and pass them into the first net and get the same result.

Hope that helps!