annargrs / blog

MIT License
0 stars 0 forks source link

2019/leaderboards/ #1

Open utterances-bot opened 5 years ago

utterances-bot commented 5 years ago

How the Transformers broke NLP leaderboards - Hacking semantics

With the huge Transformer-based models such as BERT, GPT-2, and XLNet, are we losing track of how the state-of-the-art performance is achieved?

https://hackingsemantics.xyz/2019/leaderboards/

xiaoyunwu commented 5 years ago

It is great writeup, and it is true that while we are validating the big models does something good, we should pay attention that 1. there are cases that we can not use it for inference cost issues; 2. we have not solve the problem entirely.

jabowery commented 4 years ago

The most rigorous model selection criterion is the size of self-extracting archive of the benchmark corpus. That approximation of Kolmogorov Complexity has been used in The Hutter Prize for Lossless Compression of Human Knowledge since 2006. With the much greater resources being thrown at language models nowadays, a larger benchmark corpus is in order as well as a much higher resource limit.

Perplexity is less rigorous because it has more unspecified parameters than does Kolmogorov Complexity’s single parameter:

The choice of Universal Turing machine (UTM) used to execute the program, whose length is the definition of Kolmogorov Complexity. That program has a simple job: Output the data being modeled (ie: It is an executable archive of the data.) As long as the algorithmic resources provided by the UTM platform (ie: libraries such as tensorflow, etc.) available to the self-extracting archive remains constant, the benchmark is both fair and rigorous.

Perplexity not only fails to specify the choice of UTM upon which the language model runs, it also fails to specify:

1) The length, in bits, of the program that interprets the language model. 2) The length of a “word” (or other lexical entity used as basic metric). It is inadequate to specify this as average length of a word since other statistical moments (such as standard deviation, skewness, kurtosis) may affect the measurement, as well as the precise definition of what a word comprises. 3) How non-alpha characters, such as punctuation, are to be measured/handled.

As for generality, Kolmogorov Complexity is so general, it is the basis for one of the primary definitions of Artificial General Intelligence: Solmonoff Induction’s unification with Sequential Decision Theory in AIXI.

Frankly, I’m quite perplexed by Google’s 2013 decision to go with Ciprian Chelba’s use of perplexity in their “One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling”. 7 years earlier, Kolmogorov Complexity was adopted as the benchmark for language modeling by the originator of AIXI in The Hutter Prize for Lossless Compression of Human Knowledge. Not only that, but Hutter’s PhD students started DeepMind in 2010.

VERY strange!

Isinlor commented 3 years ago

IMO we just need harder benchmarks. MATH dataset by Hendrycks et al is perfect example. To get to a human performance a model would need 10^35 parameters assuming a log-linear scaling trend.

While enormous Transformers pretrained on massive datasets can now solve most existing text-based tasks, this low accuracy indicates that our MATH dataset is distinctly harder. Accuracy also increases only modestly with model size: assuming a log-linear scaling trend, models would need around 10^35 parameters to achieve 40% accuracy on math, which is impractical. Instead, to make large strides on the MATH dataset with a practical amount of resources, we will need new algorithmic advancements from the broader research community.

I would be very happy to see SoTA chasing on benchmarks from Hendrycks group. MATH, APPS, CUAD, Massive Multi Task etc. I would be ok with paying even $10 per work of model equivalent to hour of human PhD student work. Especially if a model would be significantly faster than a PhD student.

In other words, the problem is not with models being too big. The problem is with models not being as good as human experts.