LeMei / UniMSE

169 stars 24 forks source link

Feature dimension mismatch #14

Closed sailist closed 1 year ago

sailist commented 1 year ago

In all your shared feature files (like meld_data_0610.pkl), the feature of a single utterance is two-dimensional, which means the feature of a dialogue will be three-dimensional and a batch will result in a four-dimensional tensors.

However, in the model's input part, I noticed that visual and acoustic features are directly inputted into the RNN model, which means the batch-level visual and acoustic features are three-dimensional tensors. I did not find dimensionality reduction code in this repository. Could you please explain or update the code accordingly?

sailist commented 1 year ago

A similar problem also exists in #6, and has not been properly addressed. The key is that the code and the shared feature files are not consistent. Even if the variable strings can infer the purpose of the variables, the code cannot be successfully reproduced without correct files.

LeMei commented 1 year ago

we pad the modal representation if

In all your shared feature files (like meld_data_0610.pkl), the feature of a single utterance is two-dimensional, which means the feature of a dialogue will be three-dimensional and a batch will result in a four-dimensional tensors.

However, in the model's input part, I noticed that visual and acoustic features are directly inputted into the RNN model, which means the batch-level visual and acoustic features are three-dimensional tensors. I did not find dimensionality reduction code in this repository. Could you please explain or update the code accordingly?

If the modal representation is missing, we pad the modal representation with the random initialization. For video and audio modalities, we use .unsqueeze() to expand the dim to represent their sequence information. Furthermore, we set two kinds of RNN model. One requires the sequential information, and the other don't require it.

LeMei commented 1 year ago

A similar problem also exists in #6, and has not been properly addressed. The key is that the code and the shared feature files are not consistent. Even if the variable strings can infer the purpose of the variables, the code cannot be successfully reproduced without correct files.

The files we shared are the feature files on the four datasets. The train, valid, test files still need to be constructed. You can check the files data_processor.py, preprocess.py and create_dataset.py. Note we don't directly take the original extracted features as the modal input. We pad or cut the modal features for the special occasions like missing or different feature dimm based on the extracted features in the preprocessing phase, and then input them into model. Furthermore, the variable strings containing ' restaurant' or 'laptop' can be invalid using the '//'. The two variables means the the datasets restaurant and laptop. We tried to use them but they did not work.

sailist commented 1 year ago

Thanks for you kindly reply, my problem may not be related to .unsqueeze(), and I would like to confirm three issues:

  1. After all data processing, whether the initial input (batch-level) of audio and video feature dimensions is four-dimensional.
  2. If it is NOT four-dimensional, how is it processed given your shared feature files?
  3. If it is four-dimensional, the RNN model does handle sequence information, but the required input dimension of RNN is (batch, seq_len, feature_dim), which is three-dimensinal. How to achieve that?
sailist commented 1 year ago

I did see your padding operation, but none of the operations changed the dimension of the original feature, including: here in create_dataset.py and here in data_processor, I'm not sure if I have missed something.

LeMei commented 1 year ago

Thanks for you kindly reply, my problem may not be related to .unsqueeze(), and I would like to confirm three issues:

  1. After all data processing, whether the initial input (batch-level) of audio and video feature dimensions is four-dimensional.
  2. If it is NOT four-dimensional, how is it processed given your shared feature files?
  3. If it is four-dimensional, the RNN model does handle sequence information, but the required input dimension of RNN is (batch, seq_len, feature_dim), which is three-dimensinal. How to achieve that?

I think i understand your questions. First, your issue about 'The feature of a single utterance is two-dimensional, which means the feature of a dialogue will be three-dimensional and a batch will result in a four-dimensional tensors.' is not correctly. It's three-dimensional. The feature of a single utterance is two-dimensional, then the dimension is three-dimensional on a batch.

Maybe our misunderstanding is the sample in dataset is utterance-level or dialogue-level? In our work, we take utterance as the sample rather than dialogue. This way does not consider well the whole context for a utterance but it simple.

For the second issue, we mentioned that in the previous answer.

sailist commented 1 year ago

I totally understand, thanks a lot again!