aliannejadi / ClariQ

ClariQ: SCAI Workshop data challenge on conversational search clarification.
130 stars 26 forks source link

ClariQ

Introduction

The main aim of the conversational systems is to return an appropriate answer in response to the user requests. However, some user requests might be ambiguous. In Information Retrieval (IR) settings such a situation is handled mainly through the diversification of search result page. It is however much more challenging in dialogue settings.

We release the ClariQ dataset [3, 4], aiming to study the following situation for dialogue settings:

The main research questions we aim to answer as part of the challenge are the following:

ConvAI3 Data Challenge

ClariQ was collected as part of the ConvAI3 (http://convai.io) challenge which was co-organized with the SCAI workshop (https://scai-workshop.github.io/2020/). The challenge ran in two stages. At Stage 1 (described below) participants were provided with a static dataset consisting mainly of an initial user request, clarifying question and user answer, which is suitable for initial training, validating and testing. At Stage 2, we brought a human in the loop. Namely, the top 3 systems, resulted from Stage 1, were invited to develop systems that were exposed to human annotators.

Stage 1: initial dataset

Taking inspiration from Qulac [1] dataset, we have crowdsourced a new dataset to study clarifying questions that is suitable for conversational settings. Namely, the collected dataset consists of:

For training, the collected dataset is split into training (187 topics) and validation (50 topics) sets. For testing, the participants are supplied with: (1) a set of user requests in conversational form and (2) a set a set of questions (i.e., question bank) which contains all the questions that we have collected for the collection. Therefore to answer our research questions, we suggest the following two tasks:

Stage 2: human-in-the-loop

The second stage of the ClariQ data challenge enables the top-performing teams of the first stage to evaluate their models with the help of human evaluators. To do so, we ask the teams to generate their responses in a given conversation and pass the results to human evaluators. We instruct the human evaluators to read and understand the context of the conversation and write a response to the system. The evaluator assumes that they are part of the conversation. We evaluate the performance of a system in two respects: (i) How much the conversation can help a user find the information they are looking for and (ii) How natural and realistic does the conversation appear to a human evaluator.

ClariQ Dataset

We have extended the Qulac [1] dataset and base the competition mostly on the training data that Qulac provides. In addition, we have added some new topics, questions, and answers in the training set. The test set is completely unseen and newly collected. Like Qulac, ClariQ consists of single-turn conversations (initial_request, followed by clarifying question and answer). In addition, it comes with synthetic multi-turn conversations (up to three turns). ClariQ features approximately 18K single-turn conversations, as well as 1.8 million multi-turn conversations. Below, we provide a short summary of the data characteristics, for the training set:

ClariQ Train

Feature Value
# train (dev) topics 187 (50)
# faceted topics 141
# ambiguous topics 57
# single topics 39
# facets 891
# total questions 3,929
# single-turn conversations 11,489
# multi-turn conversations ~ 1 million
# documents ~ 2 million

Below, we provide a brief overview of the structure of the data.

Files

Below we list the files in the repository:

File Format

train.tsv, dev.tsv:

train.tsv and dev.tsv have the same format. They contain the topics, facets, questions, answers, and clarification need labels. These are considered to be the main files, containing the labels of the training set. Note that the clarification needs labels are already explicitly included in the files. Regarding the question relevance labels for each topic, these labels can be extracted inderictly: each row only contains the questions that are considered to be relevant to a topic. Therefore, any other question is deemed irrelevant while computing Recall@k. In the train.tsv and dev.tsv files, you will find these fields:

Below, you can find a few example rows of train.tsv:

topic_id initial_request topic_desc clarification_need facet_id facet_desc question_id question answer
14 I'm interested in dinosaurs I want to find information about and pictures of dinosaurs. 4 F0159 Go to the Discovery Channel's dinosaur site, which has pictures of dinosaurs and games. Q00173 are you interested in coloring books no i just want to find the discovery channels website
14 I'm interested in dinosaurs I want to find information about and pictures of dinosaurs. 4 F0159 Go to the Discovery Channel's dinosaur site, which has pictures of dinosaurs and games. Q03021 which dinosaurs are you interested in im not asking for that i just want to go to the discovery channel dinosaur page

test.tsv:

test.tsv only contains the list of test topics, as well as their ID's. Below we see some sample rows:

topic_id initial_request
201 I would like to know more about raspberry pi
202 Give me information on uss carl vinson.

question_bank.tsv:

question_bank.tsv constitutes of all the questions in the collection. So, all the questions that participants may re-rank and select for the test set are also included in this question bank. The TSV file has two columns, question_id, which is a unique ID to the question, and question, which is the text of the question. Below we see some example rows of the file:

question_id question
Q00001
Q02318 what kind of medium do you want this information to be in
Q02319 what kind of penguin are you looking for
Q02320 what kind of pictures are you looking for

Note: Question id Q00001 is reserved for cases when a model predicts that asking clarifying questions is not required. Therefore, selecting Q00001 means selecting no question.

dev_synthetic.pkl.tar.gz and train_synthetic.pkl.tar.gz:

These files contain dicts of synthetically built multi-turn conversations (up to three turns). We follow the same approach explained in [1] to generate these conversations. The format of these files is very similar to the format of the test file that will be fed to the system (see below), except for having the current question and answer of a conversation. Each record in this dict is identified by its topic, facet, conversation context, question, and answer. Below we see the dict structure:

{<record_id>: {'topic_id': <int>,
  'facet_id': <str>,
  'initial_request': <str>,
  'question': <str>,
  'answer': <str>,
  'conversation_context': [{'question': <str>,
   'answer': <str>},
  {'question': <str>,
   'answer': <str>}],
  'context_id': <int>},
  ...
  }

where

single_turn_train_eval.pkl and multi_turn_****_eval.pkl.tar.gz:

These files are dicts of pre-computed document relevance results after asking each question. The document relevance performance is calculated as follows:

As we see, one has first to identify the evaluation_metric they are interested in, followed by a context_id and question_id. Notice that here we report the retrieval performance for both with and without considering the answer to the question. Furthermore, we also include two other values, namely, MAX and MIN. These refer to the maximum and minimum performance that the retrieval model achieves by asking the "best" and "worst" questions among the candidate questions. Below we see a sample of the data:

{ 'NDCG20: 
    [ 
      'F0513': 
      {
        'Q00045' : 
         {
           'no_answer': 0.2283394055312402,
           'with_answer': 0.2233114358097999
         }
         , ... , 
         'MAX': 
          {
            'no_answer': 0.30202557044031736,
            'with_answer: 0.28863807501469424
          },
         'MIN':
          {
            'no_answer: 0.16989316652772574,
            'with_answer: 0.054861833842573086
          } 
      }
  ]
  ...
}

Notice that this dict contains the following evaluation metrics:

Note: If a question is selected for a topic, that is not among the candidate questions (thus not appearing in single_turn_train_eval.pkl, the document relevance is assumed to be equal to MIN for the facet.

Note: The context_id in the multi-turn dictionaries is an int. The multi-turn dicts also contain single-turn dialogs. For those, the context_id equals the facet_id after removing the initial F and casting to int. On the other hand, for the single-turn dict, the context_id is actually facet_id.

top10k_docs_dict.pkl.tar.gz

top10k_docs_dict.pkl.tar.gz is a dict consisting of a list of document ID's for a given topic_id. In case one plans to use the contents of a document in their model, and does not have access to ClueWeb09 or ClueWeb12 data collections, this dict is useful for having the list of top 10,000 documents as an initial ranking. The participants can use this list for two purposes:

train.qrel & dev.qrel

These files contain the relevance assessments of ClueWeb09 and ClueWeb12 collections for every facet in the train and dev sets, respectively. They follow the conventional TREC format for qrel files, that is:

<facet_id> 0 <document_id> <relevance_score>

Some sample lines of train.qrel file is shown below:

F0001 0 clueweb09-en0038-74-08250 1
F0001 0 clueweb09-enwp01-17-11113 1
F0002 0 clueweb09-en0001-02-21241 1
F0002 0 clueweb09-en0006-52-11056 1

ClariQ Evaluation Script

We provide an evaluation script, called clariq_eval_tool.py to evaluate submitted runs. We strongly recommend participants to evaluate their models on the dev set using this script before submitting their runs. clariq_eval_tool.py can be used to evaluate three subtasks:

Below, we see all the possible commands that one can pass to clariq_eval_tool.py:

usage: clariq_eval_tool.py [-h] --eval_task EVAL_TASK
                           [--experiment_type EXPERIMENT_TYPE]
                           [--data_dir DATA_DIR] --run_file RUN_FILE
                           [--out_file OUT_FILE] [--multi_turn]

And here is the full description if one passes -h argument:

optional arguments:
  -h, --help            show this help message and exit
  --eval_task EVAL_TASK
                        Defines the evaluation task. Possible values: clarific
                        ation_need|document_relevance|question_relevance
  --experiment_type EXPERIMENT_TYPE
                        Defines the experiment type. The run file will be
                        evaluated on the data that you specify here. Possible
                        values: train|dev|test. Default value: dev
  --data_dir DATA_DIR   Path to the data directory.
  --run_file RUN_FILE   Path to the run file.
  --out_file OUT_FILE   Path to the evaluation output json file.
  --multi_turn          Determines if the results are on multi-turn
                        conversations. Conversation is assumed to be single-
                        turn if not specified.

As the description above is self-contained in most cases, we only add some additional remarks below:

Requirements

Examples

Below, we give some examples of how to use the script and what to expect as output:

python ./src/clariq_eval_tool.py --eval_task document_relevance \
                                 --data_dir ./data/ \
                                 --experiment_type dev \
                                 --run_file ./sample_runs/dev_best_q \
                                 --out_file ./sample_runs/dev_best_q.eval

Would produce the output below:

NDCG1: 0.3541666666666667
NDCG3: 0.33374776946106466
NDCG5: 0.3064048059484046
NDCG10: 0.26443649709165346
NDCG20: 0.22765633337753358
P1: 0.41875
P3: 0.37916666666666665
P5: 0.32875
P10: 0.256875
P20: 0.186875
MRR100: 0.4882460524507918

An example on question relevance:

python ./src/clariq_eval_tool.py --eval_task question_relevance \
                                 --data_dir ./data/ \
                                 --experiment_type dev \
                                 --run_file ./sample_runs/dev_bm25 \
                                 --out_file ./sample_runs/dev_bm25_question_relevance.eval

Would produce the output below:

Recall5: 0.3245570421150917
Recall10: 0.5638042646208281
Recall20: 0.6674997108155003
Recall30: 0.6912818698329535

Run file format

To evaluate a run using the evaluation script, each file should be formatted as follows. The following files can be evaluated using the script:

Below we explain how each file should be formatted.

Question ranking

This file is supposed to contain a ranked list of questions per topic. The number of questions per topic could be any number, but we evaluate only the top 30 questions. We follow the traditional TREC run format. Each line of the file should be formatted as follows:

<topic_id> 0 <question_id> <ranking> <relevance_score> <run_id>

Each line represents a relevance prediction. <relevance_score> is the relevance score that a model predicts for a given <topic_id> and <question_id>. <run_id> is a string indicating the ID of the submitted run. <ranking> denotes the ranking of the <question_id> for <topic_id>. Practically, the ranking is computed by sorting the questions for each topic by their relevance scores. Here are some example lines:

170 0 Q00380 1 6.53252 sample_run
170 0 Q02669 2 6.42323 sample_run
170 0 Q03333 3 6.34980 sample_run
171 0 Q03775 1 4.32344 sample_run
171 0 Q00934 2 3.98838 sample_run
171 0 Q01138 3 2.34534 sample_run

This run file will be used to evaluate both question relevance and document relevance. Sample runs can found in ./sample_runs/ directory.

Clarification need

This file is supposed to contain the predicted clarification_need labels. Therefore, the file format is simply the topic_id and the predicted label. Sample lines can be found below:

171 1
170 3
182 4

Multi-turn Input/Output

Each team in the second stage must submit a system that accepts the conversation in the following format, and produces output as described.

Input format

{<record_id>: {'topic_id': <int>,
  'facet_id': <str>,
  'initial_request': <str>,
  'conversation_context': [{'question': <str>,
   'answer': <str>},
  {'question': <str>,
   'answer': <str>}],
  'context_id': <int>},
  ...
  }

where

Output format

The system output should be submitted in a single file per set (dev and test) in the following format:

<context_id> 0 “<question_text>” <ranking> <relevance_score> <run_id>

Participants may submit more than one response per context_id, however, we only evaluate the first response in the ranked list per context_id. <question_text> must be quoted. Empty string ("") value for <question_text> indicates that a system asks no question for a given context (i.e., Q00001). This could be the case where the system predicts that no further improvement can be achieved by asking clarifying questions, or no further clarification is required. We mark empty question as the end of a conversation, and count the number of turns based on that.

Notice that <question_text> must be an str of the question. As participants are allowed to select a question from the question_bank or generate clarifying questions, we only take full text strings as input. In case a question is selected from the question_bank, simply quote the text of the question. An example generated output can be found below:

784 0 "are you looking for reviews related to the pampered chef" 0 13 bestq_multi_turn
785 0 "" 0 13 bestq_multi_turn
813 0 "are you interested in a current map of the united states" 0 17 bestq_multi_turn
820 0 "are you looking for a specific type of solar panels" 0 10 bestq_multi_turn
841 0 "" 0 15 bestq_multi_turn

Baselines

BM25 Ranker

BERT-based Ranker

We have trained a BERT-based model for the question_relevance task. The model fine-tunes BERT for retrieve relevant questions to a given topic. The model is tested on two different evaluation setups, i.e., question reranking and question ranking. The reranking model takes the top 30 predictions of BM25 and reranks them, while the full ranking model ranks all the questions available in the question bank. The results of the two models can be found in the leaderboard. Special thanks to Gustavo Penha, who kindly developed the models based on the Transformer Rankers library, and shared the code in a Google Colab Notebook.

Citing

@inproceedings{aliannejadi2021building,
    title={Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions},
    author={Mohammad Aliannejadi and Julia Kiseleva and Aleksandr Chuklin and Jeff Dalton and Mikhail Burtsev},
    year={2021},
    booktitle={{EMNLP}}  
}

Acknowledgments

The challenge is organized as a joint effort by the University of Amsterdam, Microsoft, Google, University of Glasgow, and MIPT. We would like to thank Microsoft for their generous support of data annotation costs. We would also like to thank the Webis Group for giving us access to ChatNoir search API. We appreciate Gustavo Penha's efforts in development of BERT-based baselines for the task. Thanks to the crowd workers for their invaluable help in annotating ClariQ.

References