bigscience-workshop / promptsource

Toolkit for creating, sharing and using natural language prompts.
Apache License 2.0
2.65k stars 350 forks source link

Writing: final 28 hour tasks #493

Closed srush closed 2 years ago

srush commented 3 years ago
awebson commented 2 years ago

Finished a very rough draft of the Results section, erring on the side of writing down more information than we need, since you can always remove/comment out things. The current narrative is a little too obsessed with the horserace against FLAN, and I feel like I haven't sufficiently analyzed our own good results.

RQ2 figures are mostly ready in Colab, I just need to do more rearranging layouts and other small tweaks.

I agree that we should ideally formalize the notion of robustness, not solely variance. I wrote down a compressed version of you and Colin's small debate during our call, but this is only meant to be a work in progress. I don't know if we will converge on a precise definition of robustness by deadline but I will definitely keep trying.

I think the ultimate question I want to answer is: Given a task and all the "reasonable" prompts you can give to a human, which ones do models do well vs. poorly and why? I don't know what's the correct summary statistics yet, but "robustness" was meant to be (I'm not saying it is) a proxy measurement of how many prompts do models don't "understand".

Also, with large variances, it is likely that on some datasets T0+/++ are simply not statistically significantly different from T0, which is scientifically important even if in practice people will use the one with higher median. We don't have enough time to code up and write up the Wilcoxon tests now but a high priority for me for the second draft.

VictorSanh commented 2 years ago

i took a stab at the experimental details