Pre processing of input data

face-analysis / emonet

Official implementation of the paper "Estimation of continuous valence and arousal levels from faces in naturalistic conditions", Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos and Maja Pantic, Nature Machine Intelligence, 2021

https://www.nature.com/articles/s42256-020-00280-0

Other

268 stars 69 forks source link

Pre processing of input data #4

Open Girish-03 opened 3 years ago

Girish-03 commented 3 years ago

Hi,

The work is really amazing and results seems to be astonishing.

I am a student and trying to use this code for one of my research project. I would like to know if there is a specific pre processing technique to be used before feeding in the images to the network. For instance, I am detecting the faces in video frames using OpenCV Caffe model DNN face detector, cropping it, resizing it to 256x256 and feeding to the network. But, the valence and arousal values along with categorical emotion I am getting is not matching for many frames. I am assuming I might not be doing some preprocessing of input frames as required by Emonet model. Also, if there is any specific technique to be used for detecting and cropping the faces. Therefore, requesting your guidance here. I performed the estimation and visualization on the same video provided in the paper to compare to your results, but its not the same. Below is the link to the video with original results (Valence arousal bars and categorical emotions) and results from my pre processing (as explained above).

(The Green vertical and blue horizontal bars with emotion in red text are my results.) Using 5 class model https://drive.google.com/file/d/1--GW_J3XUDNbo59YOTbLJ-VPWS4-2oey/view?usp=sharing Using 8 class model https://drive.google.com/file/d/1jJ9Ah7rcoN3aVkLYPq8cDajdRTnwsamU/view?usp=sharing

antoinetlc commented 3 years ago

Hello,

Thank you for your interest. It is hard to say what is wrong without having the code. Are the values of the input image in the range [0;1] ? Otherwise, I would advise you to have a look at the dataloader we provide for the AffectNet dataset and in particular these lines : https://github.com/face-analysis/emonet/blob/master/emonet/data/affecnet.py#L122#L131 This is where we apply the transformations to the cropped images obtained from a face detector.

You can also look at these lines in the test.py file : https://github.com/face-analysis/emonet/blob/master/test.py#L35#L51 This is where the transformations are created and passed to the dataloaders.

Hope this helps

AhmadAsnaashari commented 3 years ago

Hi,

The work is really amazing and results seems to be astonishing.

I am a student and trying to use this code for one of my research project. I would like to know if there is a specific pre processing technique to be used before feeding in the images to the network. For instance, I am detecting the faces in video frames using OpenCV Caffe model DNN face detector, cropping it, resizing it to 256x256 and feeding to the network. But, the valence and arousal values along with categorical emotion I am getting is not matching for many frames. I am assuming I might not be doing some preprocessing of input frames as required by Emonet model. Also, if there is any specific technique to be used for detecting and cropping the faces. Therefore, requesting your guidance here. I performed the estimation and visualization on the same video provided in the paper to compare to your results, but its not the same. Below is the link to the video with original results (Valence arousal bars and categorical emotions) and results from my pre processing (as explained above).

(The Green vertical and blue horizontal bars with emotion in red text are my results.) Using 5 class model https://drive.google.com/file/d/1--GW_J3XUDNbo59YOTbLJ-VPWS4-2oey/view?usp=sharing Using 8 class model https://drive.google.com/file/d/1jJ9Ah7rcoN3aVkLYPq8cDajdRTnwsamU/view?usp=sharing

Hello @Girish-03 Like you, I got different results compared to the demo. Is your problem solved?

antoinetlc commented 3 years ago

Hello, Sorry for the delay in answering. We do not do any specific preprocessing apart from what is done in the DataAugmentor class : https://github.com/face-analysis/emonet/blob/master/emonet/data_augmentation.py#L47

One issue I can think of is the fact that OpenCV loads image in BGR format, whereas our network was trained using the RGB format (we load images using skimage - see the affectnet dataloader, get_item function : https://github.com/face-analysis/emonet/blob/master/emonet/data/affecnet.py#L120). Maybe this is the issue...

Hope this helps!

uf644 commented 3 years ago

Hello I wonder what the "4 dimensional input" exactly is. I followed the "DataAugmentor" but only got a three dimensional input. This is the bug: "RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [3, 256, 256] instead"

mdabbah commented 1 year ago

I'm also having an issue validating the network's prediction on stock images I do suspect that the problem lies within the normalization and data preparation part.

I've tried many variations including flipping the channels from RGB to BGR cropping the image to include only the face using a an off the shelf face detector (validated that teh cropped image only includes my face)

normalized the input array:

tried no notmalization (values 0-255),
normalized by 1/255 (values 0-1),
normalize by subtracting mean = [0.485, 0.456, 0.406] and div by std = [0.229, 0.224, 0.225]

always resize image to 256,256

non of the above variations worked and the network still predicts the wrong emotion and valiance and arousal (target is happy emotion, which should give high positive valence and positive arousal)

in the code in the repository there is no code that does input normalization, only resize transform

could you please point me to the correct data preparation steps?

Thanks

kdoodoo commented 1 year ago

image_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
])
def classify1(model, image_transforms, image_path ):

    image = Image.open( image_path)
    image = image.convert('RGB')
    image = image_transforms(image)
    image = image.unsqueeze(0)
    image = image.cuda()
    output = model(image)
    print(image_path, ',',output['expression'][0, :].tolist(),',',np.argmax(output['expression'][0, :].tolist()),',', output['arousal'].tolist(),',', output['valence'].tolist())

I got result as it is.

SuperRuarua commented 5 months ago

image_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
])
def classify1(model, image_transforms, image_path ):

    image = Image.open( image_path)
    image = image.convert('RGB')
    image = image_transforms(image)
    image = image.unsqueeze(0)
    image = image.cuda()
    output = model(image)
    print(image_path, ',',output['expression'][0, :].tolist(),',',np.argmax(output['expression'][0, :].tolist()),',', output['arousal'].tolist(),',', output['valence'].tolist())

我得到了结果。

nice！

SuperRuarua commented 5 months ago

image_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
])
def classify1(model, image_transforms, image_path ):

    image = Image.open( image_path)
    image = image.convert('RGB')
    image = image_transforms(image)
    image = image.unsqueeze(0)
    image = image.cuda()
    output = model(image)
    print(image_path, ',',output['expression'][0, :].tolist(),',',np.argmax(output['expression'][0, :].tolist()),',', output['arousal'].tolist(),',', output['valence'].tolist())

我得到了结果。

好！

goodgood