Open myeomans opened 6 years ago
Hi, nice to hear from you!
Hi Michael,
the data collection setting was the following. We told people that the aim of the conversation is to get to know their peer. They were encouraged to ask questions about the peer's hobbies, family, pets, etc. We didn't give any other instructions, but provided some suggestions (e.g. "If your profile says that you love rock music, ask your peer about her favourite music and tell about music you like") - without that users often didn't know what to say.
We didn't set neither the duration of dialogue, nor the min/max number of turns. Users could finish a dialogue any time, we didn't give any additional instructions about that. The task was simply "chat with a peer, learn something about her, rate her performance". However, we paid only for meaningful dialogues with >=3 utterances from every participant (dialogues were checked manually by other crowd workers). If user stopped responding, we automatically finished dialogue in 10 minutes.
Hello, and thank you for the quick response! Let me respond to each item below.
We are writing this up as a small part of an invited commentary, which will have hard deadlines in December. So ideally we could have a proper pointer by then. For now I can cite the github repository webpage as a placeholder but this is not ideal in the long run (perhaps we could swap in the right pointer once you have it?). Or perhaps if you know all the author/journal info for the upcoming chapter, we can list it as "forthcoming" in lieu of page numbers.
Great!
Great, thank you Varvara!
Sorry for being unclear, let me explain. We want to know the exact text of the evaluation question they answered. I could imagine many valid ways to evaluate quality ("what was the quality of the conversation?", "how enjoyable was the conversation?", "how much did you enjoy talking to the bot?", "how much did you like the bot?", and so on). And then the five-point response scale might also have text labels on the two end points - such as {"not at all enjoyable" and "very enjoyable"}, or {"low-quality"/"high-quality"}. All of these are fine, we just want to know what question the users were answering!
Thanks, makes sense. This might not fit our current timeline, then, but we will stay tuned in the future.
Hi folks! I wanted to reach out again to see if you have any update on the proper citation for your NIPS 2019 Competition Chapter. Right now we have it as:
"Zhang et al., (2019) The Second Annual Conversational Intelligence Challenge at NIPS. Citation Forthcoming."
Even a placeholder with the right authors would be useful...
Thanks, this was answered in another thread! -Mike
Hi, I want to follow up on this thread - our paper has received an R & R, and we were asked a specific question by one of our reviewers, that is related to this thread above. We are wondering whether it would be possible to re-open this issue with you, now that the contest is over? Specifically, we would like to know which bots were participating in each conversation. We don't need identifiable names - rather, we simply want to have a hashed identifier of each bot, so that we can cluster our standard errors at the bot level and adjust for bot-level fixed effects. We are assuming the humans are all unique, as well? Please let me know if you think this data would be shareable here.
Hello again! We've made some great progress with the new summer dataset and we think it might be useful as an out-of-sample test case for some of our other work. In fact, I'm writing up the results and I want to make sure I describe your experiments accurately.
Do you mind if I ask a few clarifying questions? We don't have any concerns about the data, but we want to include some basic descriptives of the population in our write-up. People will be curious and I want to get it correct! Your documentation has been useful so far, but I am still wondering about a few items:
What would be the right way to cite you all? Should we point to the repository? The 2018 ACL (Zhang et al.) doesn't cover the newer data, is there an update in the works?
The crowd workers on Yandex.Toloka - Is this similar to mechanical turk? Were they paid? Do you use any attention checks, or worker qualification, etc. (english proficiency?) to select people?
Similarly - What were the humans' instructions? Are they asked to chat for a set amount of time, or turns? Were they incentivized for their responses at all, or to finish?
The new single-question evaluation (on a five-point scale) is great, but do you have the word-for-word question of what participants were asked?
The last two items are "wants" but not "needs"... You didn't collect time stamps for each turn in the transcripts, did you? Likewise, do you know which bot comes from which team? It's clear there are a finite number of bots having several conversations. For example, some bots seem to always start with a very specific line (e.g. " i am a little tired from work") Others break down in a consistent way (e.g. search for "Traceback (most recent call last):") that we scrubbed. We'd be curious which bot comes from which team.