Cannot train for P(X2|Y1, X1) instead of P(X2, Y1) in X1 -> Y1 -> X2 -> Y2 conversation

wise-east commented 3 years ago

Bug description

Given a conversation with X1 -> Y1 -> X2 -> Y2, how can we make sure that when we're trying to model X's responses that the first training sample is not simply Y1 -> X2 but also considers X1 as part of the context such that the model is trained to generate X2 given both X1 and Y1?

Based on the implementation in parlai/core/teachers.py, it doesn't seem like the ParlAIDialogTeacher using the ParlAI Dialog Format will handle this case, neither the Conversation Teacher with the Conversation Format.

I'm looking at the episodes from the empathetic dialogues dataset and the second episode below seems to start an episode without any context about the first turn of the conversation. Is it possible to provide the first turn as context, and if so, how can I do it with the ParlAI Dialog Format or Conversation Format? Even if I set opt['label_turn'] to 'both' for the Conversation Teacher, I think this implementation shows that the first turn will be dropped for the second speaker.

- - - NEW EPISODE: empathetic_dialogues - - -
I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people, we felt like the only people in the world.
   Was this a friend you were in love with, or just a best friend?
This was a best friend. I miss her.
   Where has she gone?
We no longer talk.
   Oh was this something that happened because of an argument?
- - - NEW EPISODE: empathetic_dialogues - - -
Was this a friend you were in love with, or just a best friend?
   This was a best friend. I miss her.
Where has she gone?
   We no longer talk.

I know this bug misses at most one turn (X1) per conversation when modeling X's responses, but X1 may often contain important information relevant for the next responses. Also, I want to know if there may be a way to train a model to learn only a subset of turns while providing the full conversation history. For instance, if I have X1 -> Y1 -> X2 -> Y2 -> X3 -> Y3 and only want my model to learn Y3, is there a way to do that with ParlAI? More specifically, can I provide all the previous turns as context only with an understanding that X1 and X2 came from speaker X and Y1, Y2 came from speaker Y without actually providing them also as training samples?

Reproduction steps command: parlai display_data -t empathetic_dialogues

Expected behavior

- - - NEW EPISODE: empathetic_dialogues - - -
    I remember going to see the fireworks with my best friend. It was the first time we ever spent time alone together. Although there was a lot of people, we felt like the only people in the world. -> something like this but not given as label text to be predicted, but rather just given as context
Was this a friend you were in love with, or just a best friend?
   This was a best friend. I miss her.
Where has she gone?
   We no longer talk.

Additional context I want to make sure that the dialogue systems that I train with ParlAI are, for each turn, being trained with the full context that is available.

wise-east commented 3 years ago

Is __SILENCE__ in line 170 in the teachers.py specifically for this purpose as mentioned in #2188?

stephenroller commented 3 years ago

We usually handle this by inserting a fake __SILENCE__ turn

wise-east commented 3 years ago

@stephenroller Thank you for the reply!

Does that mean __SILENCE__ processed as a special token by all ParlAI models? And if we add a fake __SILENCE__ turn, doesn't that make the model also learn to generate X1 from __SILENCE__?

Could you also give me insights on how I can do this from the original post:

Can we train a model to learn only a subset of turns while providing the full conversation history? For instance, if I have X1 -> Y1 -> X2 -> Y2 -> X3 -> Y3 and only want my model to learn Y3, is there a way to do that with ParlAI? More specifically, can I provide all the previous turns as context only with an understanding that X1 and X2 came from speaker X and Y1, Y2 came from speaker Y without actually providing them also as training samples when learning how to generate Y3?

stephenroller commented 3 years ago

Yes, we teach the model p(x1|silence) in order to teach it x1.

wise-east commented 3 years ago

@stephenroller Thank you!

Could you give any insight on a related question: whether the current ParlAI framework allows training a model to learn only a subset of turns while providing the full conversation history?

For instance, if I have X1 -> Y1 -> X2 -> Y2 -> X3 -> Y3 and only want my model to learn Y3, is there a way to do that with ParlAI? More specifically, can I provide all the previous turns as context only with an understanding that X1 and X2 came from speaker X and Y1, Y2 came from speaker Y without actually providing them also as training samples when learning how to generate Y3?

stephenroller commented 3 years ago

You're best bet there is to flatten the dataset and only utilize the final turn. See the flatten mutator as a sketch of the solution (without filtering).

wise-east commented 3 years ago

@stephenroller awesome, I'll take a look. thank you so much for your quick responses!

facebookresearch / ParlAI

Cannot train for P(X2|Y1, X1) instead of P(X2, Y1) in X1 -> Y1 -> X2 -> Y2 conversation #3885