axa-group / nlp.js

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more
MIT License
6.22k stars 616 forks source link

Has anyone predigested a large training set? #842

Closed ninjamoba closed 1 year ago

ninjamoba commented 3 years ago

Has anyone tried training with : https://pile.eleuther.ai/

Perhaps we can start a shared library of pretrained corpus from this set as a general starting point?

Any suggestions about the best way to use this above set? Would this be performant - could it scale to GPT-3 scope?

Or does this defeat the intended purpose of this repository as a "light" NLP library?

jesus-seijas-sp commented 3 years ago

NLP.js is a set of libraries to do NLP in javascript, mainly intended to build Conversational AI. You can do a lot of things with NLP.js that are more generalistic: normalize, tokenize, stem, calculate freqs, n-grams, .... But is clearly not GPT-3, GPT-3 training cost is around 4.600.000$ (https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/)

My poor laptop does not have even enough space for this 800GB of data in HD, I don't even imagine how to handle such an amount of data in terms of memory. So for working with such an amount of data, the infrastructure cost is something to take into account. So I'm sorry, but I will not even try :(

ninjamoba commented 3 years ago

ok so maybe not on your lap top ;) - You and NLP.js are such a beacon of hope. If this is only a horsepower issue - I think we can figure out a way to get some of these libraries digested. even just to experiment. You know GPT-3 is trained on really dirty data and these training libraries seem legit - From your response I can see Its not a bad project - and we can eat the elephant bite by bite. :)

aigloss commented 1 year ago

Closing due to inactivity