The hyperparameters - Githubissues

StonyBrookNLP / ircot

Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23

https://arxiv.org/abs/2212.10509

Apache License 2.0

173 stars 21 forks source link

The hyperparameters #4

Closed ShulinCao closed 1 year ago

ShulinCao commented 1 year ago

Hi Harsh, I am wondering for the 4 datasets, what's the K (the number of paragraphs to retrieve at each step) and M ( the number of distractor paragraphs) for IRCoT. Could you please provide the details? Thanks!

HarshTrivedi commented 1 year ago

Hi Shulin, thanks for your interest! Here are the HPs:

For IRCoT,

K ∈ {2, 4, 6, 8}
M ∈ {1, 2, 3}

For OneR,

K ∈ {5, 7, 9, 11, 13, 15}
M ∈ {1, 2, 3}

Here K is the number of paragraphs to retrieve at each step, and M is the number of distractor paragraphs.

They are also mentioned on page 5 (right side).

PS: I'll get back to your other email about MuSiQue once I hear back from my collaborators.

ShulinCao commented 1 year ago

Hi Harsh,

Thank you for your quick response!

For MuSiQue, the K (and M) was not set as one exact number for all the experiments, because the "training" examples were randomly chosen from the 20 demonstrations, and it varies for each sampling.

Hope I am right (^_^)

HarshTrivedi commented 1 year ago

No issues!

K (and M) was not set as one exact number ...

Not quite right. Let me clarify the process.

For each dataset, we prepare 20 demonstrations and then make 3 samples, each of size 15 (let's say S1, S2, S3). We then do HP tuning on the dev set using S1 and pick K and M. Using those K and M, we run the inference for each of S1, S2, and S3 on the test set. So, K (and M) are picked once for each dataset (HotpotQA, 2Wiki, MuSiQue, IIRC) and model (OneR, IRCoT) pair, and then used for all test evaluations for that pair.

Note that for each inference, I said we have 15 demonstrations, but we only keep as many demonstrations as we can fit in the max token limit (6K for Flan-T5-* and 8K for Codex). Demonstrations that appear first are kept and the rest are dropped. This varies as the length of the test instance changes with K and M.

Hope this clarifies.

ShulinCao commented 1 year ago

I see, it clarifies!

Kirushikesh commented 4 months ago

@HarshTrivedi i have one question, first of all great work, i was trying to adapt the repo for evaluating my pipeline.

So what is the difference between demonstrations and samples as you mentioned earlier. is it related to this line.
Can you also explain me about the role of distractors please in the ircot_qa setting.

Thanks!