Closed wincle closed 6 years ago
Hi, I asked the engineers from Google if they plan to release all the code to extract the video and audio features and they told me they don't have any plan yet. However they told me we should be able to compute a very similar pca matrix for the video features by ourselves as the CNN they used for the video features is the same Imagenet pre-trained Inception model from the Tensorflow website. For the Audio features, as the pre-trained CNN model is not published there is so far no hope to extract the audio features :(. In reality, as the audio features are not very important (you maybe loose 2% in GAP) you can retrain the models without the audio features and still get very good results.
Thanks! It's so kind of you! And I found this: https://github.com/tensorflow/models/tree/master/audioset , I think it may be the extractor for audio set.
Oh thank you very much ! I think they may have released it recently yes. They were not sure to do it. This is nice because they did perform any PCA for the audio so we don't need to retrieve the PCA matrix for the audio. Please tell me if you are brave enough to recompute the PCA matrix for the video features so I can release the end-to-end pre-trained pipeline for video classification. If think this is something that a lot of people might want to have !
I'm trying to recompute the PCA matrix but i'm not sure it could be done, I'm afraid the reason that google didn't release the matrix is because they don't want to totally share youtube8m but not they forget it or other excuses, If my guess is right, It's too hard for me to solve the problem.
I talked to the engineer in charge of this dataset and they assure me we should definitively be able to reproduce this PCA matrix. They just did not release it because they "do not have time for it". It is indeed hard to believe as it should not take any time to upload a matrix file but whatever, this is what I was told :).
Thans for your informations. Whatever I will try to do that.
It seems someone took the time to reconstruct this PCA matrix here: https://github.com/LittleWat/video_labelling_using_youtube8m
I am not sure if it is similar to Google's PCA matrix but you might want to take a look at it. (I found the repo by reading this thread: https://groups.google.com/forum/#!searchin/youtube8m-users/pca%7Csort:relevance/youtube8m-users/GW1HsNe4zGA/Bn2V4gHSBAAJ)
I tried that repo and test some videos with his code, results seems to be not very good. When I want to use the PCA matrix on your repo with the model I have trained with your code which use all training data and validation data of 300,000 steps , I found the result is so strange and I think it shouldn't happened because the my model is more complex than his. I just simply concentrate the 1024 dimension visual features and 128 dimension audio features, then Dequantize them, am I right? the code is shown below, vfeature is visual feature[num_frames,1024] and afeature is audio feature[num_frames,128], the feature is extract from inception V3 and audioset , the visual feature should be right because I use same methods with that repo to extract them :
def print_predicted_label(vfeature,afeature, topn=10, id2label_csv='./videoModel/label_names.csv'): id2label_ser = pd.read_csv(id2label_csv, index_col=0) id2label = id2label_ser.to_dict()['label_name']
with tf.Session() as sess :
meta_graph_location = "./videoModel/model.ckpt-265840.meta"
latest_checkpoint = "./videoModel/model.ckpt-265840"
saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True)
saver.restore(sess, latest_checkpoint)
input_tensor = tf.get_collection("input_batch_raw")[0]
num_frames_tensor = tf.get_collection("num_frames")[0]
predictions_tensor = tf.get_collection("predictions")[0]
padded_feature = np.zeros([300, 1152])
padded_feature[:vfeature.shape[0], :1024] = Dequantize(vfeature)
padded_feature[:afeature.shape[0], 1024:1152] = Dequantize(afeature)
video_batch_val = padded_feature[np.newaxis, :, :].astype(np.float32)
num_frames_batch_val = np.array([vfeature.shape[0]], dtype=np.int32)
predictions_val, = sess.run([predictions_tensor], feed_dict={input_tensor: video_batch_val,
num_frames_tensor: num_frames_batch_val})
predictions_val = predictions_val.flatten()
top_idxes = np.argsort(predictions_val)[::-1][:topn]
pprint.pprint([(id2label[x], predictions_val[x]) for x in top_idxes])
sess.close()
Could it be possible to found the pca-matrix by ourselves? Thx!