chuyq / MESC

10 stars 1 forks source link

About the dataset and source code #1

Open ffcarina opened 2 months ago

ffcarina commented 2 months ago

Hi! Thanks for your great work and for kindly sharing this valuable dataset.

After reading the paper and observing the dataset, I have a few questions to ask:

  1. How do you determine the boundaries for the utterances? I noticed that some complete sentences have been split into multiple utterances (as shown in the image below). Snipaste_2024-08-26_21-29-42

  2. As the conversations in MESC often contain multiple system utterances, I am curious about how the data is input during training for the System Response Generation task. Specifically, do you treat each system utterance as a separate target response and input its historical utterances? In other words, does each conversation produce as many instances as there are system utterances? Snipaste_2024-08-26_21-38-31

  3. Do you have any plans to release the source code for us to study and reproduce your results?

Thank you once again for your contributions. I look forward to your response.

chuyq commented 2 months ago

Thank you for your interest in our work. We will release the source code once the paper is accepted. In our approach, utterances are segmented according to the speaker talking in the video. To align with the video timestamps, an utterance might be divided into multiple segments within the file. As a result, the system's responses are also segmented in a similar manner to ensure proper alignment. Our methods generates the responses based on conversational turns.

ffcarina commented 2 months ago

@chuyq Thanks for your promote response!

ffcarina commented 2 months ago

@chuyq Sorry to disturb you, but I’m still a bit confused about what you mentioned regarding “generates the responses based on conversational turns.” In the conversation shown below, utterances 10, 14, 15, and 16 all come from the Therapist. In your approach, do you input utterances 0-9 and generate utterance 10 as the response? For the consecutive utterances 14-16, do you input 0-13 and generate 14, then input 0-14 and generate 15, or do you input 0-13 and generate 14-16 together as the response?

Snipaste_2024-08-27_11-05-16
chuyq commented 2 months ago

In our approach, when the input is 0-9, we generate '10' as the response. If the input is 0-13, we compare the generated responses with the utterances 14-16 to evaluate the generating metric.

ffcarina commented 2 months ago

@chuyq Thank you!

ffcarina commented 2 months ago

@chuyq Hi! May I ask which website you downloaded the raw videos of In Treatment from?

ffcarina commented 2 months ago

@chuyq Hi! We would like to reproduce the SMES framework described in your paper. However, we have a few questions regarding the implementation details:

  1. Could you please specify which model was used as the LLM backbone in the SMES framework?
  2. We would like to confirm whether the LLM generates the User Emotion, User Emotion, System Emotion, and System Response all at once during training. Specifically, is the input dialogue history and the output a text sequence formatted as “(depression)(Open question) (neutral) Why do you hate yourself?”?

We would greatly appreciate your clarification.

chuyq commented 2 months ago

We select BlenderBot as the backbone model. The output is a sequence of text formatted as user emotion, strategy prediction, system emotion, and response generation.

ffcarina commented 2 months ago

@chuyq Did you use a larger version of BlenderBot, such as the 2.7B or 9.4B models? I tried using the BlenderBot-90M, but the results I obtained were significantly lower compared to what’s reported in your paper.

chuyq commented 2 months ago

If this is the case, please double-check your code to ensure that you're using the extracted emotional cues during model training.

ffcarina commented 2 months ago

If this is the case, please double-check your code to ensure that you're using the extracted emotional cues during model training.

The results I obtained were much worse than those reported in your paper (e.g., my ROUGE-L score is only 20+, whereas your paper reports over 40). From TABLE VI, it seems that the video information wouldn’t account for such a large discrepancy. As your work and thu-coai/Emotional-Support-Conversation both use the Blenderbot as the backbone model, I modified their source code to re-implement your method for MESC. Could the large difference in results be due to differences in training details or evaluation code?

SHIVAM3052 commented 2 weeks ago

@chuyq Did you use a larger version of BlenderBot, such as the 2.7B or 9.4B models? I tried using the BlenderBot-90M, but the results I obtained were significantly lower compared to what’s reported in your paper.

Hi @chuyq @ffcarina can you share me your source code. It's a great help if you share👍

SHIVAM3052 commented 1 week ago

Hi @chuyq @ffcarina

Help me out how you finetune/trained the model on blenderbot.