andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
http://andrewowens.com/multisensory/
Apache License 2.0
220 stars 61 forks source link

Question about the shift_net.py' training #16

Open xiaoyiming opened 5 years ago

xiaoyiming commented 5 years ago

It's really nice work. However, I met some problems when I read the shift_net.py. As described follows: ims = self.inputs[i]['ims'] samples_ex = self.inputs[i]['samples'] assert pr.both_examples assert not pr.small_augment labels = tf.random_uniform( [shape(ims, 0)], 0, 2, dtype = tf.int64, name = 'labels_sample') samples0 = tf.where(tf.equal(labels, 1), samples_ex[:, 1], samples_ex[:, 0]) samples1 = tf.where(tf.equal(labels, 0), samples_ex[:, 1], samples_ex[:, 0]) labels1 = 1 - labels

      net0 = make_net(ims, samples0, pr, reuse = reuse, train = self.is_training)
      net1 = make_net(None, samples1, pr, im_net = net0.im_net, reuse = True, train = self.is_training)
      labels = tf.concat([labels, labels1], 0).

My understanding is that the samples_ex is the stereo audio with the size of batch_size X N X 2(N is the length of the audio signal). However, why is the labels is variable ? Should it be constant (means 0 denotes not synchronized and 1 denotes synchronized) ? I'am looking for your reply.

andrewowens commented 5 years ago

Yes, it probably would have made more sense to make labels a constant. I did it this way so that each GPU's mini-batch had equal numbers of non-shifted and shifted examples, and so that every example appears twice (both as a shifted and non-shifted). I don't think this was necessary, though.

xiaoyiming commented 5 years ago

@andrewowens thanks for reply! However, I met some other problems. In the shift_dest.py feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string) feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string) one: what's is stored in the 'im__0' and 'im_1' ?
two:It is the output of tf.gfile.FastGFile function ? three: the 'im_0' include the first half frames of the video and the 'im_1' include the second half frames of the video four: If three is true,why divide a video into two parts? I'am looking for your reply.

ruizewang commented 5 years ago

@andrewowens thanks for reply! However, I met some other problems. In the shift_dest.py feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string) feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string) one: what's is stored in the 'im__0' and 'im_1' ? two:It is the output of tf.gfile.FastGFile function ? three: the 'im_0' include the first half frames of the video and the 'im_1' include the second half frames of the video four: If three is true,why divide a video into two parts? I'am looking for your reply.

The same question. Looking forward to a sample code of generating shift dataset.