chuyq / MESC

7 stars 0 forks source link

About the dataset and source code #1

Open ffcarina opened 3 weeks ago

ffcarina commented 3 weeks ago

Hi! Thanks for your great work and for kindly sharing this valuable dataset.

After reading the paper and observing the dataset, I have a few questions to ask:

  1. How do you determine the boundaries for the utterances? I noticed that some complete sentences have been split into multiple utterances (as shown in the image below). Snipaste_2024-08-26_21-29-42

  2. As the conversations in MESC often contain multiple system utterances, I am curious about how the data is input during training for the System Response Generation task. Specifically, do you treat each system utterance as a separate target response and input its historical utterances? In other words, does each conversation produce as many instances as there are system utterances? Snipaste_2024-08-26_21-38-31

  3. Do you have any plans to release the source code for us to study and reproduce your results?

Thank you once again for your contributions. I look forward to your response.

chuyq commented 3 weeks ago

Thank you for your interest in our work. We will release the source code once the paper is accepted. In our approach, utterances are segmented according to the speaker talking in the video. To align with the video timestamps, an utterance might be divided into multiple segments within the file. As a result, the system's responses are also segmented in a similar manner to ensure proper alignment. Our methods generates the responses based on conversational turns.

ffcarina commented 3 weeks ago

@chuyq Thanks for your promote response!

ffcarina commented 3 weeks ago

@chuyq Sorry to disturb you, but I’m still a bit confused about what you mentioned regarding “generates the responses based on conversational turns.” In the conversation shown below, utterances 10, 14, 15, and 16 all come from the Therapist. In your approach, do you input utterances 0-9 and generate utterance 10 as the response? For the consecutive utterances 14-16, do you input 0-13 and generate 14, then input 0-14 and generate 15, or do you input 0-13 and generate 14-16 together as the response?

Snipaste_2024-08-27_11-05-16
chuyq commented 3 weeks ago

In our approach, when the input is 0-9, we generate '10' as the response. If the input is 0-13, we compare the generated responses with the utterances 14-16 to evaluate the generating metric.

ffcarina commented 3 weeks ago

@chuyq Thank you!

ffcarina commented 2 weeks ago

@chuyq Hi! May I ask which website you downloaded the raw videos of In Treatment from?

ffcarina commented 5 days ago

@chuyq Hi! We would like to reproduce the SMES framework described in your paper. However, we have a few questions regarding the implementation details:

  1. Could you please specify which model was used as the LLM backbone in the SMES framework?
  2. We would like to confirm whether the LLM generates the User Emotion, User Emotion, System Emotion, and System Response all at once during training. Specifically, is the input dialogue history and the output a text sequence formatted as “(depression)(Open question) (neutral) Why do you hate yourself?”?

We would greatly appreciate your clarification.

chuyq commented 4 days ago

We select BlenderBot as the backbone model. The output is a sequence of text formatted as user emotion, strategy prediction, system emotion, and response generation.