Open orthosiphon opened 6 years ago
Sorry that wasn't clear! You should use the "shift" model if you are using the pretrained network for downstream tasks (e.g. action recognition). Please let me know if you run into any issues with that (for reference, the "sep" model is for source separation, and "cam" is the class activation map model; we use it for localization).
Thank you for your quick response!
I have other two questions:
1. In shift_example.py
,
# Example using the pretarined audio-visual net. This example predicts
# a class activation map (CAM) for an input video, then saves a
# visualization of the CAM.
pr = shift_params.shift_v1()
model_file = '../results/nets/shift/net.tf-650000'
gpu = None
# uncomment for higher-resolution CAM (like the ones in the paper)
# pr = shift_params.cam_v1()
# model_file = '../results/nets/cam/net.tf-675000'
Does it means that since the beginning, the pretrained audio-visual features can detect sound source location since we can switch between the two models?
2. About the make_net
function in shift_net.py
, I guess there is normal net and also 'resized' net according to:
self.ims_ph = tf.placeholder(tf.uint8, [1, pr.sampled_frames, pr.crop_im_dim, pr.crop_im_dim, 3])
self.ims_resize_ph = tf.placeholder(tf.uint8, [1, pr.sampled_frames, None, None, 3])
predict_cam_resize()
is used in 'shift_example.py` which corresponds to the second line in the code above. For the first line, is it for giving the net images after being cropped randomly?
Thank you! :smiley: :smiley:
Yeah, you can compute sound localization using either net -- the "cam" net is just 2x the spatial resolution. In our paper, the sound localization comes from the network's attention map. To get the "cam" model we fine-tuned the "shift" model after removing spatial stride in the last layer of the net (i.e. it's just "shift" but with a higher resolution in the final layer).
Yeah, that instantiation of the network simply resizes the images to be 224 x 224 before processing them. You could resize the images yourself and pass them into the first net and get the same result.
Hope that helps!
Very great work! The idea is very interesting and thank you for providing the codes.
After running the script
download_models.sh
, I found out that there are several pretrained models in the folder, which are cam, sep, and shift. I am a little bit confuse about which model for which purpose. For example, which is the model for,Thank you.