explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

Support for multiple lemma from Token #238

Closed bittlingmayer closed 8 years ago

bittlingmayer commented 8 years ago

Token.lemma right now returns only a single lemma, as far as I understand.

As you know there are always words in a language that are "clashing" surface forms of different lemmata. coating -> coating (n) coating -> coat (v) drunk -> drunk (n) drunk -> drink (v) dove -> dove (n) dove -> dive (v) ... Of course in some other top languages it is even more common.

SpaCy is smart about this:

>>> doc = nlp(u'I dove into the water.')
>>> list(doc.sents)[0][1].lemma_
u'dive'
>>> doc = nlp(u'The dove flies.')
>>> list(doc.sents)[0][1].lemma_
u'dove'

Is there or could there be a way for spaCy to tell us that there were multiple candidate lemmata? eg Token.lemmata Otherwise, when the text is very short or otherwise ambiguous, it's basically flipping a coin.

(I know it's basically asking for all possible parses. I only ask because I believe you have everything necessary basically implemented already.)

honnibal commented 8 years ago

Look in spacy/morphology.pyx and spacy/lemmatizer.py. Preferred access is via nlp.vocab.morphology.

Basically you should be able to supply the (text, tag) pair yourself and get back the set of WordNet lemmas. It would've been nice to use the probability distribution spaCy has loaded instead of the coin-flip, but it's actually very rare in English to have a relevant ambiguity here.

To get a mistake, you would have to have two words that are homophonous in their inflected form, but derive from different lemmas. It's true that the same is unlikely to hold up well when we deal with other languages.

bittlingmayer commented 8 years ago

Thanks, for others:

>>> from spacy.parts_of_speech import ADV, VERB, NOUN, PUNCT, ADJ
>>> nlp.vocab.morphology.lemmatize(NOUN, 67276)
67276
>>> nlp.vocab.morphology.lemmatize(VERB, 67276)
39818
>>> nlp.vocab.morphology.lemmatize(PUNCT, 67276)
67276

I agree that ambiguities are rare for longer sentences, but for shorter phrases (user-generated text, headlines and subject lines, software strings) they happen at a steady rate. Take an real example that's causing us pain, like https://api.spacy.io/displacy/index.html?full=Ships%20fast! I'm inclined to say this happens at least as much in English as in other languages. The probabilities of the two interpretations were probably near even (and we would love to know that).

Right now ship (n) and ship (v) are not considered distinct lemmata, that's another and more English issue.

Re possibility for mistakes at the word-level, ignoring ship (n) and ship (v), clashes are common enough in English, because often token1.orth_ == token2.lemma_ (ie the inflected form of token2 is the same as the base form of token1). So words like coating (n), defining (adj) or dated (adj) "hide" the other lemma, since they were derived using inflectional suffixes. For those for now we can do something like:

if lemmatize(NOUN, orth) != lemmatize(VERB, orth):
...

I appreciate that it's tricky business too, since people can make up new nouns and verbs. (SpaCy is very smart about this, it correctly parses "The pinging of the server has not stopped." and even "The server is fhytoging me.", although fhytoging is not lemmatised to fhytog.)

honnibal commented 8 years ago

Just so we can talk a little more clearly about this: what we're really discussing here are part-of-speech ambiguities, which are then causing you errors from the deterministic lemmatization process, which takes the POS tag as an input. The lemmatization in itself is correct --- it's just that it's getting incorrect input from the tagger.

spaCy's models are all trained with a loss function that's optimized for 1-best accuracy. This leads to sparser solutions, which reduce memory requirements and loading times. But the downside is that you can't really ask meaningful questions about how confident the model is. It might be worth talking about your requirements a bit when you get to Berlin. It sounds to me like a beam-search version of spaCy that does joint POS tagging and parsing would help you a lot. I could prepare that for you over a short consulting engagement.

If you want to dig around the scores of the current model, here's how. I haven't documented this API yet, but you can consider it "legit" --- it will be supported going forward.

The code below shows how to gather the scores assigned to the 'winning' move at each state of the parse tree. It also shows one way to force a particular tag to be assigned to a word prior to parsing. What we do is tell the tagger to assign a sequence of tags from a list of strings, and swap in the tag we want, given its list of predictions. This is admittedly awkward, and we should support assigning to the .tag attribute of tokens directly.

I've shown two ways to use the scores: by summing the scores across the parse, and by taking the minimum score assigned to a move. The minimum score possibly makes more sense, but neither are really good indications of parse quality. Again, that wasn't part of the objective function of the current model. We can either train the parser's decision model in a different way to be better at that task, or we can train a distinct model to estimate P(parse | sentence), or we can train a model to estimate P(sentence, parse), for situations where you want to know whether the input sentence was ungrammatical or noisy.


import numpy as np
import plac
import spacy.en

def get_scores(nlp, text, force_tag=None):
    probs = []
    tokens = nlp.tokenizer(text)
    nlp.tagger(tokens)
    if force_tag:
        tags = [force_tag] + [w.tag_ for w in tokens[1:]]
        nlp.tagger.tag_from_strings(tokens, tags)
    with nlp.parser.step_through(tokens) as state:
        while not state.is_final:
            action = state.predict()
            probs.append(max(state.eg.scores))
            state.transition(action)
    return tokens, probs

def main():
    nlp = spacy.en.English()
    toks, probs = get_scores(nlp, u'Communicate your ideas clearly.', force_tag='NN')
    print([w.tag_ for w in toks])
    print(min(probs), sum(probs))
    toks, probs = get_scores(nlp, u'Communicate your ideas clearly.', force_tag='VB')
    print(min(probs), sum(probs))
    print([w.tag_ for w in toks])
    toks, probs = get_scores(nlp, u'Communicate your ideas clearly.', force_tag='NNP')
    print(min(probs))
    print([w.tag_ for w in toks])

if __name__ == '__main__':
    plac.call(main)

Produces:

[u'NN', u'PRP$', u'NNS', u'RB', u'.']
(231.49497985839844, 2340.7755584716797)
(334.1315612792969, 2806.1406860351562)
[u'VB', u'PRP$', u'NNS', u'RB', u'.']
252.046356201
[u'NNP', u'PRP$', u'NNS', u'RB', u'.']
bittlingmayer commented 8 years ago

This is very helpful for our purposes, thank you.

I am curious if these types of signals are useful to other clients. (I can understand the appeal of a minimal library, not just in terms of pruning/performance but in terms of the APIs.)

Re whether this is a question of parsing or of lemmatisation, yes, pardon. I was trying to take on the more straightforward topic, but a bit more parse functionality will indeed void the need for more lemmatisation functionality.

Unfortunately we are little too early-stage right now to take you up on the consulting offer. If things go well then there will be plenty down the road.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.