Writing: final 28 hour tasks

bigscience-workshop / promptsource

Toolkit for creating, sharing and using natural language prompts.

Apache License 2.0

2.65k stars 350 forks source link

Intro
- [x] @craffel read
- [ ] @stephenbach read
- [x] @VictorSanh read
- [ ] finish intro with results
Related Work
- [x] @craffel read
- [ ] @stephenbach read
- [x] @VictorSanh read
Datasets
- [x] Do we need to change figure 2? Albert: I don't think so. We don't need to report results of everything in the train mixture.
Prompts
- [x] Additional high-level paragraph on promptsource tool / statistics on prompters
- [x] Summary statistics about prompts? (length / variables / words / number per dataset @srush )
- [ ] @craffel read
- [x] @stephenbach read
- [x] @VictorSanh read
Experimental setup
- [x] All the experimental details (@craffel )
- [x] All the evaluation details (@VictorSanh)
Results - RQ1
- [x] Figure 1.1 - All of the zero shot tasks with T5, T0, GPT3 (waiting on T5 eval)
- [x] Figure 1.2 - All of the zero shot dataset with T5, T0, GPT3 (waiting, unclear?)
- [ ] Figure 1.3 - Size Comparison of T5-xxl and T5-xl
- [x] Figure 1.4 - BIG-Bench
- [x] Writing (@awebson )
Results - RQ2
- [x] Figure 2.1 - Comparison of less prompts (box and whisker)
- [x] Figure 2.2 - Comparison of less datasets (box and whisker)
- Discussion about robustness (@awebson )
- [x] Inline - GPT3 failure on multiple prompts
- [x] Writing that explains our definition of robustness (WIP)
No separate analysis section
Discussion
- [x] High-level discussion about FLAN? (no experimental comps; Albert: currently discussed throughout the Results section.)
[ ] Conclusion
Miscellaneous
- [x] Fix repeated citations error
- [ ] Fix page numbers in included pdf (last)
- [ ] Check that the appendix does not unblind
- [ ] Proofread appendices
- [ ] Check for doubled cites

Finished a very rough draft of the Results section, erring on the side of writing down more information than we need, since you can always remove/comment out things. The current narrative is a little too obsessed with the horserace against FLAN, and I feel like I haven't sufficiently analyzed our own good results.

RQ2 figures are mostly ready in Colab, I just need to do more rearranging layouts and other small tweaks.

I agree that we should ideally formalize the notion of robustness, not solely variance. I wrote down a compressed version of you and Colin's small debate during our call, but this is only meant to be a work in progress. I don't know if we will converge on a precise definition of robustness by deadline but I will definitely keep trying.

I think the ultimate question I want to answer is: Given a task and all the "reasonable" prompts you can give to a human, which ones do models do well vs. poorly and why? I don't know what's the correct summary statistics yet, but "robustness" was meant to be (I'm not saying it is) a proxy measurement of how many prompts do models don't "understand".

Also, with large variances, it is likely that on some datasets T0+/++ are simply not statistically significantly different from T0, which is scientifically important even if in practice people will use the one with higher median. We don't have enough time to code up and write up the Wilcoxon tests now but a high priority for me for the second draft.

bigscience-workshop / promptsource

Writing: final 28 hour tasks #493