Justin1904 / TensorFusionNetworks

Pytorch Implementation of Tensor Fusion Networks for multimodal sentiment analysis.
169 stars 44 forks source link

calculating features #2

Closed amirim closed 6 years ago

amirim commented 6 years ago

Hi,

Lets say I'd like to calculate the whole thing for a given video (a test sample) from zero - Can you post the code that calculates the tri- features of visual (openFace ?) audio (pyAudioAnalysis / Covarep ?) and language? - and how to extract utterances from audio?

Justin1904 commented 6 years ago

Hi, this implementation of TFN is based on off-the-shelf features offered by the CMU-Multimodal Data SDK, so in fact there's no code here for feature extraction from raw videos. Previously we processed the data for MOSI and MOSEI dataset using opensource toolkits like COVAREP, FACET and P2FA. In the near future we are planning to clean up and release the raw videos of MOSEI dataset and then you can just use whatever feature extraction toolkits you like to extract your own features on them.

amirim commented 6 years ago

Hi Justin, many thanks for your quick reply. Can you please describe in more details the libraries names + feature extraction procedure (especially for the language) + number of features for each modality? in order to let me explore the whole process from zero for a new video example? (let's assume that the utterance is already segmented), so the only next steps are: features extraction, alignment (that you mentioned) and the training - have I forgot something? according to the paper, OpenFace+ FACET is used, but according to your code, only FACET is used (the csv file contains 48 columns). The embedding+words contains 255 columns- which library generates the embeddings ? P2FA that you mentioned is for alignment, right? in which part of the processing you need the alignment ? the audio is the simplest with 74 features with COVAREP. anyway, what is the accuracy that you achieved after training? have you published your model anywhere?

Many thanks in advance for answering my (long) comment.

Justin1904 commented 6 years ago

Given that you've asked quite a few questions my answer might be also a bit long.

For language modality we merely used pretrained Glove embeddings without doing any additional feature extraction. Since I was not involved in the feature extraction of these datasets, I don't actually know the exact procedures. Yet I vaguely know that the raw data contains segmented videos, their corresponding audios and raw transcripts, and first P2FA is applied to generate time-stamped transcripts, which are then substituted into embeddings (just find the Glove vector for corresponding words). Then toolkits like FACET and OpenFace are applied to the raw videos to generate visual features, and COVAREP and OpenSmile are applied to the raw audios to generate acoustic ones.

A side note is that, the "alignment" I mentioned (did I even mention it? did you saw it in some parts of my code or else where?) refers not to the use of P2FA but to the alignment of multimodal features. You can imagine that after all the feature extraction mentioned above, there'll be a bunch of different types of features, and they're all in different temporal frequencies, which makes it hard to use. Here's where the CMU Multimodal Data SDK steps in: if you put your features in the required format of the SDK, you can use it to perform temporal alignment of different features. However, currently we haven't documented that part of the functionalities of the SDK and if you look at the current SDK repo you can only find guide on how to use pre-defined features. Sorry about that.

As for this repo, yes this is not a exact replicate of the paper, in terms of the usage of features. I only used FACET, COVAREP and Glove vectors in training. The "words" are not actually used, they're just one hot representations, which is almost useless when you have the embeddings already. You mentioned that the embedding + words contains 255 columns, what do you mean by that? The embeddings are supposed to be 300 dimensional.

The accuracy I got with the default hyper parameter settings in this repo is around 70% on CMU-MOSI, yet I managed to achieve around 74% with different hyper parameters previously (sorry I didn't record them though). It'll be very hard to replicate the exact numbers as reported in the TFN paper because they used a different train/valid/test split than the current standard.

amirim commented 6 years ago

Both of the following answers assume that the CMU-SDK uses this data.

You mentioned that the embedding + words contains 255 columns, what do you mean by that? The embeddings are supposed to be 300 dimensional.

Look at the csv file in MOSI directories under: processed->Transcript->embeddings

I only used FACET

Look at the csv files under processed->Video->FACET, only the full is available, you used this or the segmented (utterances) data?

However, currently we haven't documented that part of the functionalities of the SDK and if you look at the current SDK repo you can only find guide on how to use pre-defined features

where do you in your code, consider that fact of alignment? according to section 3.4 by searching the function: mosei_facet_n_words.align but in your code, nothing is found.

the "alignment" I mentioned (did I even mention it? did you saw it in some parts of my code or else where?)

I just googled it

Justin1904 commented 6 years ago

I haven't got time to check that data yet and will get back to it as soon as possible. But I can assure you that the embeddings are supposed to be 300 dimensional because they are Glove vectors, and if the embeddings in the tarball doesn't correspond to that, I suggest you use the CMU SDK to acquire processed data (look at the example provided in its readme).

For FACET, the files are in full video level, but the SDK splits it into segment level for all the video segments based on the timestamps. There is no originally segmented FACET for MOSI at this point.

For the third question I did not fully understand. P2FA is used for a so-called "forced-alignment", which is aligning the transcripts with a given audio. That "alignment" is not the "alignment" done by the SDK. What the SDK aligns is the different features from different modalities. There is NO copy of aligned features, because different people might want to align the features in different ways. Yet the SDK can help you align them once you get them, and you can find how to do that in the SDK repo's readme, it's just one line of code.

amirim commented 6 years ago

I haven't got time to check that data yet and will get back to it as soon as possible.

To make your life easier, I already found an easy example: _id--eQ6qVU_2 which includes only four words: "and I liked it". according to the trained glove.6B.300d.txt, the word: "and" should be started with: 0.038466, but according to the csv file, it's: 1.79999659622e-09,0.4190476209 ...

But I can assure you that the embeddings are supposed to be 300 dimensional because they are Glove vectors

please, take a look at your code: model = TFN(input_dims, (4, 16, 128), 64, (0.3, 0.3, 0.3, 0.3), 32)

Can you explain why those are the dimensions? according to the paper, those are the dimension: audio:32, visual:32, text:128

Justin1904 commented 6 years ago

I would suggest you use the pickled data provided by CMU SDK directly instead of trying to interpret the CSVs yourself (although they're supposed to be the same, the CSVs may not be in the format people generally assume). In our case, the first two columns in the CSVs are not the actual feature values, but the time stamps of that feature value. So 1.79999659622e-09 is actually indicating that the first word happens at almost the start of that segment, not the first dimension of the embedding. BTW we used the GloVe 840B trained on Common Crawl, so the values might be different from the 6B version.

As for the hyper-parameters, I didn't stick exactly to the original paper. In practice the model seems to be easier to train with lower visual and acoustic dimensions, but that might be one of the reasons preventing it from achieving better results as well. Thanks for pointing out.

amirim commented 6 years ago

The accuracy I got with the default hyper parameter settings in this repo is around 70% on CMU-MOSI,

Have you tried to train this network on MOSEI? can it be easily performed simply by changing the url?

Justin1904 commented 6 years ago

I think it is possible, except a few additional efforts: currently the MOSEI dataset contains a bunch of data points that uses an older version of COVAREP, which makes their feature vectors' shape different from others. So you'll probably need to modify some of the data preprocessing code to replace those inconsistently shaped vectors with correct shape dummy values.

amirim commented 6 years ago

I was wondering about the number of MOSI's different speakers (93 persons). In big data, is it sufficient enough for training? although the number of utterances is above 1000, but still .. I thought about a normalization method to treat this problem.

Justin1904 commented 6 years ago

@amirim Indeed 93 speakers is not a large number. Though other similar trimodal datasets like POM and IEMOCAP are not much larger (if not smaller at all) than MOSI. Actually there're even research specifically aiming at the problem of too few speakers in the dataset, for example, https://arxiv.org/pdf/1609.05244.pdf. We're happy to see and discuss interesting ideas on that topic