LifeIsStrange / CollectionOfInterestingBrains

Hi :) please create a new issue with your name so that I can find you later and we can have interest technical conversations/debate.
2 stars 0 forks source link

NLP/NLU/Singularity discussions #3

Open LifeIsStrange opened 2 years ago

LifeIsStrange commented 2 years ago

Hi @tomaarsen, welcome to my collection of interesting brains, you have been recognized for your interesting and useful project in word inflection generation. Since github does not provide a chat, if you wanna share thoughts/facts about NLP/NLU/information extraction/other, feel free to do it here :)

tomaarsen commented 2 years ago

We're reaching a bit of a new era since 2018's BERT. Since then, there have been massive improvements in many NLP tasks, mostly using Transformers in some way. Very exciting, but sadly a bit out of my ballpark still. I'm excited to jump into it, but the lack of explainability that deep neural networks gives is bothersome. That said, the future is neural, and most actually interesting new developments use some (huggingface) neural model. https://huggingface.co/ is super interesting, and I'd love to work there, to be honest.

LifeIsStrange commented 2 years ago

Welcome back @tomaarsen ! :) Before beginning, let me tell you that I have a huge respect for you being NLTK's maintainer! I only have experience with Spacy however NLTK has been and still is a corner stone of the NLP world. Also here's a song for the mood: https://www.youtube.com/watch?v=UlFNy9iWrpE

We're reaching a bit of a new era since 2018's BERT. Since then, there have been massive improvements in many NLP tasks, mostly using Transformers in some way. Very exciting, but sadly a bit out of my ballpark still. I'm excited to jump into it, but the lack of explainability that deep neural networks gives is bothersome. That said, the future is neural, and most actually interesting new developments use some (huggingface) neural model.

Shall we begin: Haha this is a nice enthusiastic comment about the state of machine learning however I have to say I have to half-agree half-disagree on many points or quantifiers in your comment.

Watching the evolution of the state of the art in NLP has been a pleasure. I have always been mentally addicted to the idea of progress, wether by studying the sciences of humanism, rationalism or technological advances, few things ever gave me as much fascination and hope-timism as was watching the leaderboards evolve on paperswithcode.com or on NLP-progress

On transformers and progress

As you say, transformers and more generally pre-trained language models have been disrupting, enabling significant F1 score accuracy gains almost universally. I have actively followed SOTA progress on most NLP/NLU tasks since ~5 years and here is my salient analysis: The number of BERT clones/variants is astonishing, but essentially is disappointing. The iterations over BERT (roberta, distillbert, etc) have only led to very minor and incremental accuracy gains. Some have achieve impressive progress towards scalability of the number of parameters, speed and scalability of training, or lowering the need for high end hardware (mobilenets) but regarding what matters the most at the end of the day: accuracy progress has plateaued. In fact it has plateaued since 2019, 3 years ago already. IMO the scientists are allocating their mental resources very innefficiently as most of them build transformers variants of the base BERT architecture. There is one single transformer that generally outperforms all the others and it has been consistently ignored by the research. I am talking about XLnet. Published in 2019 AKA very early and yet not a single new transformer variant has been built upon it... Probably because the author is Chinese and not coming from Google. Probably too because researchers themselves are often unaware of what constitute the actual state of the art in a given task. Although the rise of popularity of paperswithcode.com might slowly change this. What is certain is that this model has a distinctive number of first places and nobody has ported to it the 2 major accuracy advances that has been made on BERT variants that I can identify: 1) the spanBERT innovation 2) scaling it to having much more parameters (like GPT 3). Besides this failure at the exploration-exploitation trade-off in the Transformer research space that is the BERT variant monoculture I also thinks that transformers are a local minima and that research trying alternative neural architectures for NLP is undefunded (economics) and underfocused (psycho-economics AKA mindshare). That is the end of the first part of my analysis, there remain at least 2 other parts to be written, but that is, for another day.

Edit: Oh wow, I have searched for language models that would be XLnets variants and surprisingly I've managed to find one! The explanation for my ignorance of it is that its results have not been added to paperswithcode.com... https://github.com/microsoft/MPNet It is extremely interesting as the model design is sound and simple, it successfully attempts to combine the best of BERT and XLnet. The resulting model, MPNet is systematically 1-2% more accurate than Roberta, BERT and XLnet and achieve +5% accuracy gains in key tasks, without any clever fine tuning. It even achieve to outperform ELECTRA by >=0.7%. Now one could wonder wether ELECTRA advances could be combined to MPNET since MPNet PLN module seems like an hortogonal, complementary addon you can plug and play to any existing language model. However MPNet existence seems to have largely ignored by research and its been 2 years.

tomaarsen commented 2 years ago

Woah, thanks for the info. My assumption was that BERT-based models were still constantly improving, given the large number of BERT clones (as you mentioned). I also had not heard about XLNet yet!

As for NLTK: the work is very interesting to me. It's essentially a large collection of small programs supplied by hundreds of contributors over the years, but weren't really maintained much beyond that. To me, it's great and fun practice in solving issues in a somewhat neglected codebase. Modernising, bug fixing, optimizing, all while learning about some of the core NLP tasks is just a lot of fun. That said, NLTK has algorithmic solutions, and it's clear that some of these are starting to become far from SOTA. For example, the NLTK sentence tokenizer will frequently split sentences after i.e., while sentences essentially never end there. I've noticed many issues here, but my enthusiasm to try and solve these issues by creating a new sentence tokenizer is frequently thwarted by the knowledge that a much better algorithmic sentence tokenizer that I could create will still be handily outperformed by some (neural) model that does the same. I recently developed something to try and figure out the usage of functionality from NLTK, which can be found on my site: https://tomaarsen.com/projects/nltk/usage/plot. From this, it's apparent that NLTK is ~25% used for its tokenization, ~25% for its corpora, ~10% for its stemming, and a bunch of other stuff for the remainder. From this, I think only the corpora section has a future, and should continue being updated. The rest will be replaced by stronger alternatives. Sadly, in the current implementation, NLTK is just a very bulky module to include. In a perfect world I would like to cull some (rarely or) unused functionality and change the importing structure to be even lazier. That way, NLTK can continue as a corpus host.

I have great respect for spaCy, and am recently trying to learn more about it. For example, I did some work on PyTextRank, a spaCy pipeline extension. I am even considering trying to do an internship or something for Explosion (the company behind spaCy). That said, I'm unsure if I'm sufficiently qualified.