flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.9k stars 2.1k forks source link

Clarification on purpose of word embedding #846

Closed Meai1 closed 4 years ago

Meai1 commented 5 years ago

So I've been reading up the entire day on Flair and these concepts. Am I correct in thinking that if I add an embedding like the sentence "The sky is green" then in reality it would split this up into multiple words, calculate word vectors for each word thereby readjusting the weights in the builtin word embeddings like GloVe a bit towards the sky being similar/related to the color green, yes? So now I have to create a large annotated corpus, painstakently built up with my own custom tags right...in my case it would be for the purpose of text classification but I am gathering that if I want NER tagging too then I would need to create an entirely new corpus and train a second time and predict a second time right? Anyway so I train that corpus.. with what embeddings? How would I make that decision, I have to look at each and figure out whether their generic word choices trained on the similarities of e.g random wikipedia text would also happen to work with my input text because it's all english? I suppose it's for convenience because otherwise I would have to assemble my own generic english word similarity trained dataset, that's the purpose of these wikipedia pretrained word vectors right? Secondly, trained word vectors are like storage for similarities of words right.. but am I not already doing that when for example I annotate with NER tags or for text classification with these label things? Those too are trying to establish similarity between words and in case of text classification, entire sentences right. What does embedding do in addition to that? It just "helps" somehow with more general english language understanding, was I right the first time? And in Flair I can help it along a bit more by adding a few more custom embeddings to tack onto these more generic Glove and news-forward etc. sets?

I'm trying to build a helpbot, so in my understanding I am in the area of "textclassification" and therefore I'll have to create a large corpus of Fasttext formated, label annotated questions, (then create another corpus for NER tagging if I need it and train that too separately). So I could use the NER tagging approach too but then I would still have to make my own custom decisions based on tags that I'll get predicted right so in my view "textclassification" is better for me.

Is that the right approach, I create my own corpus of questions->answers and use a few recommended embeddings and hope it is good enough? I would also like to know if incremental training is possible general or available in Flair. Yes, I read that it is resumable but what if I train something large for a long time on my CPU and then I add a few more helpbot questions.. it would be bad for me if I now have to train again for 10h before I can even see if I made any improvement.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.