Closed ShulinCao closed 1 year ago
Hi Shulin, thanks for your interest! Here are the HPs:
For IRCoT,
For OneR,
Here K is the number of paragraphs to retrieve at each step, and M is the number of distractor paragraphs.
They are also mentioned on page 5 (right side).
PS: I'll get back to your other email about MuSiQue once I hear back from my collaborators.
Hi Harsh,
Thank you for your quick response!
For MuSiQue, the K (and M) was not set as one exact number for all the experiments, because the "training" examples were randomly chosen from the 20 demonstrations, and it varies for each sampling.
Hope I am right (^_^)
No issues!
K (and M) was not set as one exact number ...
Not quite right. Let me clarify the process.
For each dataset, we prepare 20 demonstrations and then make 3 samples, each of size 15 (let's say S1, S2, S3). We then do HP tuning on the dev set using S1 and pick K and M. Using those K and M, we run the inference for each of S1, S2, and S3 on the test set. So, K (and M) are picked once for each dataset (HotpotQA, 2Wiki, MuSiQue, IIRC) and model (OneR, IRCoT) pair, and then used for all test evaluations for that pair.
Note that for each inference, I said we have 15 demonstrations, but we only keep as many demonstrations as we can fit in the max token limit (6K for Flan-T5-* and 8K for Codex). Demonstrations that appear first are kept and the rest are dropped. This varies as the length of the test instance changes with K and M.
Hope this clarifies.
I see, it clarifies!
@HarshTrivedi i have one question, first of all great work, i was trying to adapt the repo for evaluating my pipeline.
Thanks!
Hi Harsh, I am wondering for the 4 datasets, what's the K (the number of paragraphs to retrieve at each step) and M ( the number of distractor paragraphs) for IRCoT. Could you please provide the details? Thanks!