Open ffcarina opened 3 weeks ago
Thank you for your interest in our work. We will release the source code once the paper is accepted. In our approach, utterances are segmented according to the speaker talking in the video. To align with the video timestamps, an utterance might be divided into multiple segments within the file. As a result, the system's responses are also segmented in a similar manner to ensure proper alignment. Our methods generates the responses based on conversational turns.
@chuyq Thanks for your promote response!
@chuyq Sorry to disturb you, but I’m still a bit confused about what you mentioned regarding “generates the responses based on conversational turns.” In the conversation shown below, utterances 10, 14, 15, and 16 all come from the Therapist. In your approach, do you input utterances 0-9 and generate utterance 10 as the response? For the consecutive utterances 14-16, do you input 0-13 and generate 14, then input 0-14 and generate 15, or do you input 0-13 and generate 14-16 together as the response?
In our approach, when the input is 0-9, we generate '10' as the response. If the input is 0-13, we compare the generated responses with the utterances 14-16 to evaluate the generating metric.
@chuyq Thank you!
@chuyq Hi! May I ask which website you downloaded the raw videos of In Treatment from?
@chuyq Hi! We would like to reproduce the SMES framework described in your paper. However, we have a few questions regarding the implementation details:
We would greatly appreciate your clarification.
We select BlenderBot as the backbone model. The output is a sequence of text formatted as user emotion, strategy prediction, system emotion, and response generation.
Hi! Thanks for your great work and for kindly sharing this valuable dataset.
After reading the paper and observing the dataset, I have a few questions to ask:
How do you determine the boundaries for the utterances? I noticed that some complete sentences have been split into multiple utterances (as shown in the image below).
As the conversations in MESC often contain multiple system utterances, I am curious about how the data is input during training for the System Response Generation task. Specifically, do you treat each system utterance as a separate target response and input its historical utterances? In other words, does each conversation produce as many instances as there are system utterances?
Do you have any plans to release the source code for us to study and reproduce your results?
Thank you once again for your contributions. I look forward to your response.