Mislabeled dialog in SpokenWOZ test set?

AlibabaResearch / DAMO-ConvAI

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.

MIT License

1.1k stars 178 forks source link

Mislabeled dialog in SpokenWOZ test set? #87

Open ArneNx opened 8 months ago

ArneNx commented 8 months ago

Taking a look at and listening to the test set you released on the website, I found that dialog SNG0646 is completely mislabeled. I assume this is a mixup of some sort since the annotations are completely wrong, to the point that using the speech start and end times from the annotation leads to empty speech sequences, which leads to errors down the line. Can you confirm this? Was this also the case in the numbers you report on the website? How should this dialog be dealt with when reporting results?

ArneNx commented 8 months ago

Looking into this a bit further I found that the following dialogs are also faulty: SNG0885,SNG0653,SNG0897,SNG0890,SNG0601,SNG0877,SNG0901,SNG0903 There may be other dialogs with the same problem, but I'm not sure. Or am I missing something here?

S1s-Z commented 8 months ago

Thank you for being so interested in SpokenWOZ. We've updated the data on the leaderboard, feel free to contact sishuzheng@foxamail.com if you have any questions!

ArneNx commented 8 months ago

Hey, can you please give a few more details on what you changed? I downloaded the test data just now and the dialogs I mentioned above are still mislabeled. Am I missing something?

S1s-Z commented 8 months ago

Sorry for the mistake, we now correctly uploaded the modified data. Specifically, as we lost the audio of the dialogues you mentioned, we used AliCloud TTS to generate the corresponding audios alternatives.

ArneNx commented 8 months ago

Thank you! I listened to the generated audio. Unfortunately it sounds very different from the recorded dialogs (and are based on the noisy ASR transcription). That makes me wonder how comparable the scores still are to the leader board. However, I don't have a better suggestion if you don't want to recompute everything (maybe new/ old scores should have an asterisk in the table once new scores come in).

Additionally, did you check the other dialogs beyond the ones I listed (e.g. by computing the WER with an ASR system to find mislabeled dialogs)?

S1s-Z commented 8 months ago

Yes, this is because our new audio data was generated by TTS (Text-to-Speech)tool. During the generation process, we used noisy utterances as input text.

Unfortunately, we used these incorrect (mismatched) audio data in our evaluation and thus may have had an impact on the overall performance of the dual-modal baselines. This will make the values (e.g., JGA, SUCCESS) a bit low.

Fortunately, however, the number of such dialogues is not large (less than 10 dialogues in 1000 dialogues) and our dual-modal baseline was not carefully tuned (e.g., by carefully selecting hyper-parameters), so these errors are acceptable.

Meanwhile, we will try to check our test set again in the coming weeks.

ArneNx commented 8 months ago

It's good to hear that only so few dialogues are affected. Thank you for addressing this!

Unfortunately using the TTS generated dialogues turns out to be a bit problematic. The audio is single-channel as opposed to dual channel that was used before. Additionally, the word-by-word time annotation of the transcriptions is outdated now (as the audio changed). This makes it impossible to isolate user utterances from the system responses.

S1s-Z commented 8 months ago

Now, we've updated the relevant dialogue audios and modified them to be two-channel audios.

We've modified the final transcriptions (maybe you should use the newly updated test text data), but since the modifications may not be perfect, there may still be some misalignments, but most of the word timestamps are correct.

ArneNx commented 8 months ago

Thank you very much! I'll test it out in the next few days. Could you please report here if you find any more faulty dialogs and switch out something in the dataset?

I very much appreciate your help with this.

S1s-Z commented 7 months ago

We will also work on improving the quality of SpokenWOZ in the future, thank you for your suggestion. Looking forward to your good news!