explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.23k stars 4.4k forks source link

where can I find Training documentation for tagger, parser, NER and word2vec #397

Closed Biswajit2902 closed 8 years ago

Biswajit2902 commented 8 years ago

Hi,

I want to know, how can I train NLP system ( * like tagger, parser, NER, word2vec, doc2vec .. etc) by using my own data. But unfortunately, i am not getting proper documentation. Can anyone help me in this?

I want to know, how to train *these systems? How much data is required to train the system? Whether data needs to be labeled? And the important thing is, what is the data format should be for training?

Thank you.

cmuell89 commented 8 years ago

@Biswajit2902 . As is the case with a number of statistical algorithms in NLP, you need fairly large data sets to achieve an acceptable level of model accuracy for most applications. Or you need to have very distinct/robust features if you dataset is small. Ideally you want both large amount of data and robust features! This is a pretty basic rule of thumb for most machine learning. Some of your questions are just too broad to answer with specificity but maybe these responses might help (and I am not an expert in NLP by any means, so take what I say with a grain of salt).

Parsing and POS The parser in SpaCy is based on MaltPaser which in turn was trained on the UPenn Treebank, consisting of large data sets of linguistic dependency and POS annotations of English language corpora. https://www.cis.upenn.edu/~treebank/

NER I've inquired about how to incorporate custom named entity recognition and was told by SpaCy that it's a work-in-progress feature. Often using gazetteers and chunking techniques for very specific domains gets you pretty far when it comes to extracting NE's. Read up on Probabilistic Context Free Grammars if you're looking for a statistical approach to parsing out rule-based grammar structures from text (date phrases etc,. (https://duckling.wit.ai/). To achieve general purpose NER you once again need a large annotated set of data to make use of the more robust ML algorithms like Condition Random Fields. SpaCy's NER is dependent on good POS and dependency parsing prior to identifying NE's but I do not know what algorithm they have used.

word2vec, doc2vec SpaCy's tutorial on loading customer word vectors: https://spacy.io/docs/tutorials/load-new-word-vectors Tutorial using Gensim to create word vectors: http://rare-technologies.com/word2vec-tutorial/

I think SpaCy's documentation might feel a bit sparse because their approach is to provide established NLP techniques that match performance specs of the state-of-the-art in academy for commercial purposes. There is probably an expectation of some prior knowledge with its users. Good luck!

Biswajit2902 commented 8 years ago

@cmuell89, Thank you for your valuable suggestion.

Actually, I am trying to continuous adaptable training process for all of the above.

But during adaptation I need to know, what are my existing labels/classes (to be specific for NER). How to get the existing class labels in spacy. I am not able to find, If you know it, it will be a great help.

Thanks.

ainewsbot commented 8 years ago

If you are looking for predefined NE, look at spacy/data/en-1.1.0/ner/config.json.

honnibal commented 8 years ago

Improved training API, and added some examples in 1.0.0. Tutorials should go up soon.

Biswajit2902 commented 8 years ago

Hi Honnibal,

It's great news. Congrats for version 1.0 release and i appreciate you nice step for sentiment analysis using LSTM.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.