how to merge video and audio features, there are several ways as my understanding:
extract features from the original files, then merge them directly, then input to a deep nueral network
Or after extracting features from the original files, use deep nueral network to extract deep features, then combine them to input a deep neural network/machine learning alg to do classification
Or as what Enis recommanded, classify them separately then choose a higher accuracy result
For what we learned on class, I prefer the 2), or we could do it separately to see the result.
What do you guys think about?
how to merge video and audio features, there are several ways as my understanding: