antoine77340 / Youtube-8M-WILLOW

Kaggle Youtube 8M WILLOW approach
Apache License 2.0
465 stars 165 forks source link

Have you found the pca-matrix which used in youtube8M? #3

Closed wincle closed 6 years ago

wincle commented 7 years ago

Could it be possible to found the pca-matrix by ourselves? Thx!

antoine77340 commented 7 years ago

Hi, I asked the engineers from Google if they plan to release all the code to extract the video and audio features and they told me they don't have any plan yet. However they told me we should be able to compute a very similar pca matrix for the video features by ourselves as the CNN they used for the video features is the same Imagenet pre-trained Inception model from the Tensorflow website. For the Audio features, as the pre-trained CNN model is not published there is so far no hope to extract the audio features :(. In reality, as the audio features are not very important (you maybe loose 2% in GAP) you can retrain the models without the audio features and still get very good results.

wincle commented 7 years ago

Thanks! It's so kind of you! And I found this: https://github.com/tensorflow/models/tree/master/audioset , I think it may be the extractor for audio set.

antoine77340 commented 7 years ago

Oh thank you very much ! I think they may have released it recently yes. They were not sure to do it. This is nice because they did perform any PCA for the audio so we don't need to retrieve the PCA matrix for the audio. Please tell me if you are brave enough to recompute the PCA matrix for the video features so I can release the end-to-end pre-trained pipeline for video classification. If think this is something that a lot of people might want to have !

wincle commented 7 years ago

I'm trying to recompute the PCA matrix but i'm not sure it could be done, I'm afraid the reason that google didn't release the matrix is because they don't want to totally share youtube8m but not they forget it or other excuses, If my guess is right, It's too hard for me to solve the problem.

antoine77340 commented 7 years ago

I talked to the engineer in charge of this dataset and they assure me we should definitively be able to reproduce this PCA matrix. They just did not release it because they "do not have time for it". It is indeed hard to believe as it should not take any time to upload a matrix file but whatever, this is what I was told :).

wincle commented 7 years ago

Thans for your informations. Whatever I will try to do that.

antoine77340 commented 7 years ago

It seems someone took the time to reconstruct this PCA matrix here: https://github.com/LittleWat/video_labelling_using_youtube8m

I am not sure if it is similar to Google's PCA matrix but you might want to take a look at it. (I found the repo by reading this thread: https://groups.google.com/forum/#!searchin/youtube8m-users/pca%7Csort:relevance/youtube8m-users/GW1HsNe4zGA/Bn2V4gHSBAAJ)

wincle commented 7 years ago

I tried that repo and test some videos with his code, results seems to be not very good. When I want to use the PCA matrix on your repo with the model I have trained with your code which use all training data and validation data of 300,000 steps , I found the result is so strange and I think it shouldn't happened because the my model is more complex than his. I just simply concentrate the 1024 dimension visual features and 128 dimension audio features, then Dequantize them, am I right? the code is shown below, vfeature is visual feature[num_frames,1024] and afeature is audio feature[num_frames,128], the feature is extract from inception V3 and audioset , the visual feature should be right because I use same methods with that repo to extract them :

def print_predicted_label(vfeature,afeature, topn=10, id2label_csv='./videoModel/label_names.csv'): id2label_ser = pd.read_csv(id2label_csv, index_col=0) id2label = id2label_ser.to_dict()['label_name']

with tf.Session() as sess :
    meta_graph_location =  "./videoModel/model.ckpt-265840.meta"
    latest_checkpoint = "./videoModel/model.ckpt-265840"

    saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True)
    saver.restore(sess, latest_checkpoint)
    input_tensor = tf.get_collection("input_batch_raw")[0]
    num_frames_tensor = tf.get_collection("num_frames")[0]
    predictions_tensor = tf.get_collection("predictions")[0]

    padded_feature = np.zeros([300, 1152])
    padded_feature[:vfeature.shape[0], :1024] = Dequantize(vfeature)
    padded_feature[:afeature.shape[0], 1024:1152] = Dequantize(afeature)
    video_batch_val = padded_feature[np.newaxis, :, :].astype(np.float32)
    num_frames_batch_val = np.array([vfeature.shape[0]], dtype=np.int32)

    predictions_val, = sess.run([predictions_tensor], feed_dict={input_tensor: video_batch_val,
                                                                 num_frames_tensor: num_frames_batch_val})

    predictions_val = predictions_val.flatten()

    top_idxes = np.argsort(predictions_val)[::-1][:topn]

    pprint.pprint([(id2label[x], predictions_val[x]) for x in top_idxes])

    sess.close()