StonyBrookNLP / musique

Repository for MuSiQue: Multi-hop Questions via Single-hop Question Composition, TACL 2022
Creative Commons Attribution 4.0 International
74 stars 8 forks source link

Where can I find the original single hop questions? #5

Closed sagnik closed 10 months ago

sagnik commented 10 months ago

Hi

Thanks for the great work!

I am looking for the original single-hop questions that were combined. For example, in the first answerable validation question, we have this:

 'question_decomposition': {'id': [460946, 294723],
  'question': ['Green >> performer', '#1 >> spouse'],
  'answer': ['Steve Hillage', 'Miquette Giraudy'],
  'paragraph_support_idx': [10, 5]},

Can we use these ids to trace back the original questions from the datasets?

HarshTrivedi commented 10 months ago

These IDs refer to the unique identifiers of our compiled dataset (box 2 in Figure 2), and not original datasets. However, if you really want, it is possible to find the original QA instance from source datasets using the similarity-based match based on the question, paragraph text, and answer string from the decomposition step. I am curious though, why do you need it as you already have information about Q, C, and A from the decomposition step?

On a related note, if you are planning to train models on single-hop questions from constituting datasets, please make sure to see the Usage Caution note on the readme.

sagnik commented 10 months ago
However, if you really want, it is possible to find the original QA instance from source datasets using the similarity-based match based on the question, paragraph text, and answer string from the decomposition step. 

Thanks, I imagined doing something like this, but not before confirming. You do have the single hop questions for the dev and test set (which I imagine will reduce the search space a bit), any chance you have them for the training set?

I am curious though, why do you need it as you already have information about Q, C, and A from the decomposition step?

I want to only look at the multihop questions generated from SQuAD. If there is an indicator of the original dataset somewhere, that would solve my problem for now.

Also, off the topic, is this HF dataset an official release? If yes, this probably can be included in the README.

HarshTrivedi commented 10 months ago

I want to only look at the multihop questions generated from SQuAD

Got it. I've dataset-level information available in the internal/non-released version of dataset files. If you shoot me an email, I can send it to you.

Also, off the topic, is this HF dataset an official release? If yes, this probably can be included in the README.

No, it's not official. I intended to put it on HF datasets at some point, but never got to it.

sagnik commented 10 months ago

sent.

HarshTrivedi commented 10 months ago

Shared the files. Closing this now. Feel free to open if you've more questions.