The phrase "1M Qs" (1 million questions answered correctly) is in the Colabs and paper 2. Paper 2 gives the impression that, although we have successfully run 1M Qs, we expect we could successfully run 10M Qs. One reviewer called "1M Qs" arbitrary and railed against the assumption of complete accuracy in a model.
Some of the newly added models (using different seeds) have failed on 1 or 2 questions in 1M. The 1M run stops after the first error. This 1 question could be found after say 200K questions, not giving a good impression of how accurate/inaccurate the model is.
IT industry measures say data-centre uptime reliability using terms like "five 9s" which is 99.999%. This is a known industry term and makes clear that there is "five 9s" has 0.001 of downtime. We should move to this terminology/approach for model accuracy- rather than continue to use our own invented term. Specifically:
Change the CoLab text to use "six 9s" instead of "1M Qs" when describing empirically measured model accuracy.
Change the 1M Q Colab code so that it doesnt stop on first error. Instead have it measure the #fails/million questions. If model has 0 or 1 fail / million questions its accuracy is "six 9s". If model has <= 10 fails/million its accuracy is "five 9s". Etc. The Paper 1 model is 99% accurate so it is "two 9s".
Update Paper 2 to use "99.9999% (aka 'six 9s')" instead of 1m Qs as the empirical pre-requisite.
Update the Paper 2 stats on the various model's accuracy, currently "fails after 900K Qs", to show "#fails1/1M" and say "five 9s".
Remove from Paper 2 any references to "perfectly accurate" or "completely accurate" or similar
Claiming a model has "five 9s" or "six 9s" accuracy is still a very impressive accuracy for a transformer model.
The phrase "1M Qs" (1 million questions answered correctly) is in the Colabs and paper 2. Paper 2 gives the impression that, although we have successfully run 1M Qs, we expect we could successfully run 10M Qs. One reviewer called "1M Qs" arbitrary and railed against the assumption of complete accuracy in a model.
Some of the newly added models (using different seeds) have failed on 1 or 2 questions in 1M. The 1M run stops after the first error. This 1 question could be found after say 200K questions, not giving a good impression of how accurate/inaccurate the model is.
IT industry measures say data-centre uptime reliability using terms like "five 9s" which is 99.999%. This is a known industry term and makes clear that there is "five 9s" has 0.001 of downtime. We should move to this terminology/approach for model accuracy- rather than continue to use our own invented term. Specifically:
Claiming a model has "five 9s" or "six 9s" accuracy is still a very impressive accuracy for a transformer model.