georgesterpu / avsr-tf1

Audio-Visual Speech Recognition using Sequence to Sequence Models
GNU General Public License v3.0
81 stars 28 forks source link

KeyError: 'aus' when running run_audiovisual.py #26

Open clarahohohoho opened 3 years ago

clarahohohoho commented 3 years ago

Hi, I have an error of KeyError: 'aus' from the line normed_aus = tf.clip_by_value(self._data.payload['aus'], 0.0, 3.0) / 3.0 in encoder.py. I preprocessed my data with extract_faces.py and write_records_tcd.py with the LRS3 data. I realized that my self._data.payload is an empty dictionary. Any idea how to solve this error? Or is there any other possible variable that I can replace self._data.payload['aus'] with?

Any help is appreciated, thank you in advance!

clarahohohoho commented 3 years ago

Hi, I found out that in the write_records_tcd.py step, the function write_bmp_records() function in the dataset_write.py script the variable append_aus is defaulted to False. I have tried changing it to append_aus=True however another issue surfaced, where ValueError: 'AU25_r' is not in list. This is probably because in my csv files, the headers are only limited to ['frame,face_id,timestamp,confidence,success,pose_Tx,pose_Ty,pose_Tz,pose_Rx,pose_Ry,pose_Rz']. Any idea how can I proceed? Thank you!

georgesterpu commented 3 years ago

Hi @clarahohohoho aus stands for facial action units. We proposed to regress AUs from video representations jointly with the speech decoding task in order to overcome a learning issue of AV Align (audio encoder attends to video encoder) seen on a more challenging task than speaker-dependent TCD-TIMIT.

If you are running an experiment using the run_audiovisual.py script, please note the following parameter: regress_aus=True. When this flag is enabled, it is expected that the tfrecord file contains a sequence of action unit intensities, allowing the computation of the distance between these ground truth values and the network's prediction. You may set this flag to False, depending on the goals of your research.

To generate target values for the AU intensities, we used the OpenFace toolkit. The extract_faces.py script is a wrapper that calls the OpenFace binaries from Python and generates the bmp and csv files in the format expected by the code in this repository. The Action Units are written to the csv by appending the -aus flag, please see here the complete set of CLI arguments. I realise now that the -aus flag is not used in the example pre-processing script, but regress_aus is set to True in the AV experiment launch script, so I'll correct this issue.

You may need to pre-process again the video files setting the -aus flag in extract_faces.py, then re-generate the tfrecords. For convenience, I stored a single set of tfrecord files appended with this metadata, and only enabled or disabled AUs at runtime.

I hope this helps, please let me know if there is something else to clarify.