DeepPavlovAdmin / convai

392 stars 88 forks source link

Clarifying Questions on Summer Dataset #28

Open myeomans opened 5 years ago

myeomans commented 5 years ago

Hello again! We've made some great progress with the new summer dataset and we think it might be useful as an out-of-sample test case for some of our other work. In fact, I'm writing up the results and I want to make sure I describe your experiments accurately.

Do you mind if I ask a few clarifying questions? We don't have any concerns about the data, but we want to include some basic descriptives of the population in our write-up. People will be curious and I want to get it correct! Your documentation has been useful so far, but I am still wondering about a few items:

  1. What would be the right way to cite you all? Should we point to the repository? The 2018 ACL (Zhang et al.) doesn't cover the newer data, is there an update in the works?

  2. The crowd workers on Yandex.Toloka - Is this similar to mechanical turk? Were they paid? Do you use any attention checks, or worker qualification, etc. (english proficiency?) to select people?

  3. Similarly - What were the humans' instructions? Are they asked to chat for a set amount of time, or turns? Were they incentivized for their responses at all, or to finish?

  4. The new single-question evaluation (on a five-point scale) is great, but do you have the word-for-word question of what participants were asked?

  5. The last two items are "wants" but not "needs"... You didn't collect time stamps for each turn in the transcripts, did you? Likewise, do you know which bot comes from which team? It's clear there are a finite number of bots having several conversations. For example, some bots seem to always start with a very specific line (e.g. " i am a little tired from work") Others break down in a consistent way (e.g. search for "Traceback (most recent call last):") that we scrubbed. We'd be curious which bot comes from which team.

madrugado commented 5 years ago

Hi, nice to hear from you!

  1. The dataset is not described properly yet. It will be described in a paper which will be included in NIPS 2019 Competition Chapter. I think we still have no specific citation to offer. When do you need this?
  2. Yes, Yandex.Toloka is an analog of MTurk. Yes, the tolokers were paid. Yes, we have used English proficiency tests and quality control to select workers. In addition we have separate pool of workers to check the produced dialogs (the quality of human part of a dialog).
  3. Varvara might answer with more details than me. @varvara-l
  4. Could you elaborate please, I didn't get the question.
  5. We actually collect timestamps for all the utterances, but haven't published them. I think, we could discuss publication of this info. We haven't published the names of bots also because our challenge is not over yet. I think we could publish this after the challenge ends.
varvara-l commented 5 years ago

Hi Michael,

the data collection setting was the following. We told people that the aim of the conversation is to get to know their peer. They were encouraged to ask questions about the peer's hobbies, family, pets, etc. We didn't give any other instructions, but provided some suggestions (e.g. "If your profile says that you love rock music, ask your peer about her favourite music and tell about music you like") - without that users often didn't know what to say.

We didn't set neither the duration of dialogue, nor the min/max number of turns. Users could finish a dialogue any time, we didn't give any additional instructions about that. The task was simply "chat with a peer, learn something about her, rate her performance". However, we paid only for meaningful dialogues with >=3 utterances from every participant (dialogues were checked manually by other crowd workers). If user stopped responding, we automatically finished dialogue in 10 minutes.

myeomans commented 5 years ago

Hello, and thank you for the quick response! Let me respond to each item below.

  1. We are writing this up as a small part of an invited commentary, which will have hard deadlines in December. So ideally we could have a proper pointer by then. For now I can cite the github repository webpage as a placeholder but this is not ideal in the long run (perhaps we could swap in the right pointer once you have it?). Or perhaps if you know all the author/journal info for the upcoming chapter, we can list it as "forthcoming" in lieu of page numbers.

  2. Great!

  3. Great, thank you Varvara!

  4. Sorry for being unclear, let me explain. We want to know the exact text of the evaluation question they answered. I could imagine many valid ways to evaluate quality ("what was the quality of the conversation?", "how enjoyable was the conversation?", "how much did you enjoy talking to the bot?", "how much did you like the bot?", and so on). And then the five-point response scale might also have text labels on the two end points - such as {"not at all enjoyable" and "very enjoyable"}, or {"low-quality"/"high-quality"}. All of these are fine, we just want to know what question the users were answering!

  5. Thanks, makes sense. This might not fit our current timeline, then, but we will stay tuned in the future.

varvara-l commented 5 years ago
  1. The evaluation question was "Please evaluate the whole dialogue using one of the buttons below". There were 5 buttons with numbers from 1 to 5. We didn't provide any interpretations of the numbers. The profile selection question was "Select a profile which in your opinion belongs to your partner".
myeomans commented 5 years ago

Hi folks! I wanted to reach out again to see if you have any update on the proper citation for your NIPS 2019 Competition Chapter. Right now we have it as:

"Zhang et al., (2019) The Second Annual Conversational Intelligence Challenge at NIPS. Citation Forthcoming."

Even a placeholder with the right authors would be useful...

myeomans commented 5 years ago

Thanks, this was answered in another thread! -Mike

Hi, I want to follow up on this thread - our paper has received an R & R, and we were asked a specific question by one of our reviewers, that is related to this thread above. We are wondering whether it would be possible to re-open this issue with you, now that the contest is over? Specifically, we would like to know which bots were participating in each conversation. We don't need identifiable names - rather, we simply want to have a hashed identifier of each bot, so that we can cluster our standard errors at the bot level and adjust for bot-level fixed effects. We are assuming the humans are all unique, as well? Please let me know if you think this data would be shareable here.