How to merge video and audio features?

how to merge video and audio features, there are several ways as my understanding:

extract features from the original files, then merge them directly, then input to a deep nueral network
Or after extracting features from the original files, use deep nueral network to extract deep features, then combine them to input a deep neural network/machine learning alg to do classification
Or as what Enis recommanded, classify them separately then choose a higher accuracy result For what we learned on class, I prefer the 2), or we could do it separately to see the result. What do you guys think about?

Yunhua468 / Audio-Visual-Emotion-and-Sentiment-Research