Open ffcarina opened 2 months ago
Thank you for your interest in our work. We will release the source code once the paper is accepted. In our approach, utterances are segmented according to the speaker talking in the video. To align with the video timestamps, an utterance might be divided into multiple segments within the file. As a result, the system's responses are also segmented in a similar manner to ensure proper alignment. Our methods generates the responses based on conversational turns.
@chuyq Thanks for your promote response!
@chuyq Sorry to disturb you, but I’m still a bit confused about what you mentioned regarding “generates the responses based on conversational turns.” In the conversation shown below, utterances 10, 14, 15, and 16 all come from the Therapist. In your approach, do you input utterances 0-9 and generate utterance 10 as the response? For the consecutive utterances 14-16, do you input 0-13 and generate 14, then input 0-14 and generate 15, or do you input 0-13 and generate 14-16 together as the response?
In our approach, when the input is 0-9, we generate '10' as the response. If the input is 0-13, we compare the generated responses with the utterances 14-16 to evaluate the generating metric.
@chuyq Thank you!
@chuyq Hi! May I ask which website you downloaded the raw videos of In Treatment from?
@chuyq Hi! We would like to reproduce the SMES framework described in your paper. However, we have a few questions regarding the implementation details:
We would greatly appreciate your clarification.
We select BlenderBot as the backbone model. The output is a sequence of text formatted as user emotion, strategy prediction, system emotion, and response generation.
@chuyq Did you use a larger version of BlenderBot, such as the 2.7B or 9.4B models? I tried using the BlenderBot-90M, but the results I obtained were significantly lower compared to what’s reported in your paper.
If this is the case, please double-check your code to ensure that you're using the extracted emotional cues during model training.
If this is the case, please double-check your code to ensure that you're using the extracted emotional cues during model training.
The results I obtained were much worse than those reported in your paper (e.g., my ROUGE-L score is only 20+, whereas your paper reports over 40). From TABLE VI, it seems that the video information wouldn’t account for such a large discrepancy. As your work and thu-coai/Emotional-Support-Conversation both use the Blenderbot as the backbone model, I modified their source code to re-implement your method for MESC. Could the large difference in results be due to differences in training details or evaluation code?
@chuyq Did you use a larger version of BlenderBot, such as the 2.7B or 9.4B models? I tried using the BlenderBot-90M, but the results I obtained were significantly lower compared to what’s reported in your paper.
Hi @chuyq @ffcarina can you share me your source code. It's a great help if you share👍
Hi @chuyq @ffcarina
Help me out how you finetune/trained the model on blenderbot.
Hi! Thanks for your great work and for kindly sharing this valuable dataset.
After reading the paper and observing the dataset, I have a few questions to ask:
How do you determine the boundaries for the utterances? I noticed that some complete sentences have been split into multiple utterances (as shown in the image below).
As the conversations in MESC often contain multiple system utterances, I am curious about how the data is input during training for the System Response Generation task. Specifically, do you treat each system utterance as a separate target response and input its historical utterances? In other words, does each conversation produce as many instances as there are system utterances?
Do you have any plans to release the source code for us to study and reproduce your results?
Thank you once again for your contributions. I look forward to your response.