Open dmsfabiano opened 6 years ago
There are basically 2 main approaches that people use for this. The first is come up with features like Diego said - usually when you do this you get features from the images and from the audio; the 2nd is to project both the image and the audio signal into a new subspace and do the fusion there. For example this can be done with PCA - although that might not be a good representation of the data. When thinking about this, what are important parts of each modality for emotion (e.g. eyes and mouth are important in face, what is important in audio? Here is a link with some information about audio and emotion - http://www.scholarpedia.org/article/Speech_emotion_analysis
import dlib
import numpy as np
from skimage import io
predictor_path = "shape_predictor_68_face_landmarks.dat"
detector = dlib.get_frontal_face_detector()
predictor = dlib.shape_predictor(predictor_path)
img = io.imread("FDT.jpg")
dets = detector(img)
#output face landmark points inside retangle
#shape is points datatype
#http://dlib.net/python/#dlib.point
for k, d in enumerate(dets):
shape = predictor(img, d)
vec = np.empty([68, 2], dtype = int)
for b in range(68):
vec[b][0] = shape.part(b).x
vec[b][1] = shape.part(b).y
print(vec)
I've been putting thought on how to fuse the images with the audio. I am not sure yet on how we would fuse the images directly with the audio.
However, if we can compute some kind of metric from the images (i.e. face Landmarks) we can use a linear math fusion technique that I developed last semester. It's shown really good results, but we would have to see how it works with this data.
Please share thoughts on this, and/or other fusion techniques