Closed ghost closed 8 years ago
Yeah, you can train with just spaCy. How's your training data defined? Do you have the original text, or have you tokenized it?
Not sure what you mean by intent extraction.
Hey again! Thanks for the quickest reply ever, that's superawesome :+1: :).
The training data would be a lot of non-related short phrases like "Buy me bananas in the fruit shop." or "Go to the zoo and steal animals.". I'm kind of trying to do something similar to wit.ai without using their services. I don't like to be locked in.
So "intent extraction" would be a term coined by wit.ai I I guess, meaning the extraction of 'what command the user wants to invoke' if I may describe it in my laymans-terms. So "intent extraction" is probably looking at the overall sentence structure and applies some machine learning, I'm not sure.
Do you believe implementing something like wit.ai is possible using spaCy? (of course not, "JUST" spaCy, but I'd love to have it in the toolchain)
Sure, I'd say you can do better than wit.ai, too :). I don't think their architecture is super sophisticated.
So, one awkwardness is that currently spaCy's parser is pretty crap on imperatives (e.g. "Go to the zoo"), because it has almost none of these in its training data. An awkward hack, if you know your data is always imperative, is to actually just transform it into a declarative sentence ("You must go to the zoo"). The syntax is analysed better this way. Hopefully soon I'll finish adding this transformation during the training process, so that you don't have to do that. But for now bear that limitation in mind.
If your problem is always such that you receive a single piece of text, and must learn a single label, then I recommend using a linear model and adding a grab-bag of features that mostly consists of sections of the dependency parse.
For example:
Input: Set the volume to zero
Intent: mute_volume
Parse: http://spacy.io/displacy/?full=Set%20the%20volume%20to%20zero.
Example feature extraction:
>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'Set the volume to zero.')
>>> for word in doc:
... print(word.text, word.head.text, word.dep_)
...
Set Set ROOT
the volume det
volume Set dobj
to volume prep
zero to pobj
. Set punct
>>> for word in doc:
... for i, right in enumerate(word.rights):
... print(word.text, i, right.tag_, right.dep_)
...
Set 0 NN dobj
Set 1 . punct
volume 0 IN prep
to 0 CD pobj
>>> for word in doc:
... for i, left in enumerate(word.lefts):
... print(word.text, i, left.tag_, left.dep_)
...
volume 0 DT det
For early experiments, I would make the features string-concatenations, and use spacy.strings.StringStore
to map them to sequential integer IDs, so that it's easy to play with an external machine learning library. Once you want better performance, I would switch that part of the code to Cython, and make an integer array of the feature, and then hash it. You can see an example of this in the averaged perceptron code I use for spaCy, which is in the library thinc
.
If your data has more complicated labellings, it's probably worth using something like a CRF. Personally I always prefer beam-search and structured perceptron, because it's more efficient and more flexible. But, it starts to depend on the specifics.
Thank you! <3
It will take me some time to get back to you with additional questions, your answer points me to a lot of destinations worth looking into.
Kind regards, Arno
Hey! Since I need to extract multiple Entities as well, I might need to use something more sophisticated like technologies you listed in your last paragraph.
Could you point me to a place, where I can read more about beam-search and structured perceptron?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
If not, which additional tools would I need? Also: Do you have experiencing with intent extraction?