💫 Train parser and NER with regression objective, to make scores express expected parse quality

honnibal commented 7 years ago

More and more people have been asking about confidence scores for the parser and NER. The current model can't answer this, so I decided to dust off some almost-complete research from last year to fix this.

~This work is almost complete, and should be up on master within a day or two. :tada:. Here's how it works.~ Edit: I spoke too soon....The problem was that the regression loss objective I describe here produced extremely non-sparse solutions with the linear model. It should be possible to find a good compromise with L1 regularisation, but I switched efforts to the v2 experiments instead.

Edit 2: spaCy 2 uses neural networks, so the sparsity isn't a problem. But I haven't been able to get the regression loss working well at all. I think something's wrong with my implementation.

v2 now has beam parsing implemented, which supports one way to get quality estimates for parses --- see below. However, I'd like to resume efforts on the regression loss objective. I think there's a bug in the current implementation of this loss function. See below.

Currently the parser and NER are trained with a hinge-loss objective (specifically, using the averaged perceptron update rule). At each word, the model asks "What's the highest scoring action?". It makes its prediction, and then it asks the oracle to assign a cost to each action, where the cost represents the number of new errors that will be introduced if that action is taken. For instance, if we're at the start of an ORG entity, and we perform the action O, we introduce two errors: we miss the entity, and we miss the label. The actions B-PER and U-ORG each introduce one, and the action B-ORG introduces zero. If our predicted action isn't zero-cost, we update the weights such that in future this action will score a bit lower for this example, and the best zero-cost action will score a bit higher.

If we're only looking at the quality of the output parse, this setup performs well. But it means the scores on the actions have no particular interpretation. We don't force them into any useful scale, and we don't train them to reflect the wider parse quality. If the parser is in a bad state, it's not trained to give uniformly lower scores. It's trained to make the best of bad situations.

The changes I'm merging improve this in two ways. They're looking forwards to the spaCy 2.0 neural network models, but they're useful with the current linear models too, so I decided to get them in early.

1. Beam search with global objective

This is the standard solution: use a global objective, so that the parser model is trained to prefer parses that are better overall. Keep N different candidates, and output the best one. This can be used to support confidence by looking at the alternate analyses in the beam. If an entity occurs in every analysis, the NER is more confident it's correct.

2. Optimize the negative cost explicitly (i.e. do numeric regression, not multiclass classification)

This idea has been kicking around for a while. I think a few people have tried it with negative results. It was first raised to me in 2015 by Mark Johnson. I guess to a lot of folks it's obvious.

The idea is this: we have an oracle that tells us the number of errors an action will introduce. Instead of arbitrary high/low scores, we try to make the model output a score that matches the oracle's output. This means that if an action would introduce 2 errors, we want to predict "2". We don't just want it to score lower than some other class, that would introduce 0 errors. It's handy to flip the sign on this, so that we're still taking an argmax to choose the action.

In my previous experiments, this regression loss produced parse accuracies that were very slightly worse --- the difference in accuracy was 0.2%. In parsing research, this is indeed a negative result :).

However, this difference in accuracy doesn't matter at all --- and the upside of the regression setup is quite significant! With the regression model, the scores output by the parser have a meaningful interpretation: the sum of the scores is the expected number of errors in the analysis. This is exactly what people are looking for, and it comes with no increase in complexity or run-time. It's just a change to the objective used to train the model.

vhwen commented 7 years ago

Would you please give an example of how to get the score? Couldn't find anything in documentation. @honnibal @ines

fansg commented 6 years ago

Was it implemented and how we can use it? @honnibal

honnibal commented 6 years ago

I never shipped the linear mode regression loss because I couldn't get the memory use under control -- the loss produced very non-sparse solutions, and it was taking too many experiments to find the right regularisation.

Instead I focussed on the experiments for spaCy 2. Current versions of spaCy 2 support beam search decoding, which lets you get probabilities by asking how many beam parses the entity occurred in. We don't currently have a model trained with the beam objective online yet, so the probabilities aren't so well calibrated. You'll have to try and see. Here's a current example.

# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001 
nlp = spacy.load('en_core_web_sm')

with nlp.disable_pipes('ner'):
    docs = list(nlp.pipe(texts))
beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)

for doc, beam in zip(docs, beams):
    entity_scores = defaultdict(float)
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

honnibal commented 6 years ago

Btw, on the off-chance anyone reading this could tell me what's wrong with the regression loss here: https://github.com/explosion/spaCy/blob/cdb2d83e168b8602bcaa98b1a8f2842908eaad49/spacy/syntax/nn_parser.pyx#L216

I could rerun the regression-loss experiments using the neural network, where the sparsity problem wouldn't be an issue.

What we want is an output vector of scores length N, for our N parser/NER transition actions. Each unit scores[i] should reflect the cost of taking that action, where the cost is defined as the number of newly unreachable gold-standard arcs or entities. The costs are passed into the function as an array.

The function should be computing the gradient of the loss of this regression problem. Only some actions are valid. The gradient for an invalid action should always be 0.

jorgeaguiar commented 6 years ago

Hi! I tried to implement your example in spaCy 2.0.3 but got

AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'get_beam_parses'

on the line nlp.entity.get_beam_parses(beam) Is the example still valid for current versions of spaCy ? Or am I missing something here?

honnibal commented 6 years ago

@jorgeaguiar I had a problem in my example -- fixed. Instead of nlp.entity.get_beam_parses it's nlp.entity.moves.get_beam_parses().

jorgeaguiar commented 6 years ago

@honnibal thanks! There are still two things that don't seem to add up, though:

Your code seems to imply that nlp.entity.beam_parse() returns a list of Beam objects; however, what it does return is a tuple, with the expected list of Beam objects as its first element, and an array (?) as its second element. So, I had to change that call to (beams, somethingelse) = nlp.entity.beam_parse(...), otherwise Python would complain later. What is that somethingelse in the returned tuple? Is it important for calculating probabilities?
The (start, end, label) tuples in the entities returned by nlp.entity.moves.get_beam_parses() all have start set to 0 and end set to either -1 or 1... what do these mean?

Maybe it's easier to explain with some code...

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = u'Japan in the European Union ?'
doc = nlp(text)
for ent in doc.ents:
    print '%d %s %s' % (ent.start_char, ent.text, ent.label_)

docs = list(nlp.pipe(list(text), disable=['ner']))
(beams, somethingelse) = nlp.entity.beam_parse(docs, beam_width=16, beam_density=0.0001)

for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        print score, ents
        entity_scores = defaultdict(float)
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

This will yield

0 Japan GPE
9 the European Union ORG
0.999939530063 []
4.54607413676e-05 [(0, 1, u'CARDINAL')]
5.81728399866e-06 [(0, -1, u'PRODUCT')]
3.61413102284e-06 [(0, -1, u'ORG')]
2.50448729089e-06 [(0, -1, u'DATE')]
1.84200212209e-06 [(0, -1, u'PERSON')]
8.25288554732e-07 [(0, -1, u'CARDINAL')]
1.2167600756e-07 [(0, -1, u'GPE')]
5.78341708966e-08 [(0, 1, u'PERSON')]
4.83908104929e-08 [(0, -1, u'PERCENT')]
(...)

No apparent connection to the detected entities... Any hints? Thanks!

Zhenshan-Jin commented 6 years ago

@jorgeaguiar by adding little code at the end of your script, like this

for doc, beam in zip(docs, beams):
    entity_scores = defaultdict(float)
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(doc, start, end, label)] += score
    if not doc.text:
        print(doc.text)
        print(pprint(entity_scores))

I get the probability of each entity in for each characters, like this Then I guess there should be some way to further process this character based NER result to get the real NER probability. But still not sure.

Any suggestions to further processing this result @honnibal , Thanks!

jorgeaguiar commented 6 years ago

@Zhenshan-Jin I think I'm getting somewhere now. This works:

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = u'Will Japan join the European Union ?'
doc = nlp(text)

print '--- Tokens ---'
for tok in doc:
    print tok.i, tok   
print ''

print '--- Entities (detected with standard NER) ---'
for ent in doc.ents:
    print '%d to %d: %s (%s)' % (ent.start, ent.end - 1, ent.label_, ent.text)
print ''

# notice these 2 lines - if they're not here, standard NER
# will be used and all scores will be 1.0
with nlp.disable_pipes('ner'):
    doc = nlp(text)

(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print '--- Entities and scores (detected with beam search) ---'
for key in entity_scores:
    start, end, label = key
    print '%d to %d: %s (%f)' % (start, end - 1, label, entity_scores[key])

Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong. @honnibal any hints on how to properly train a model to use beam search, like you mentioned in https://support.prodi.gy/t/accessing-probabilities-in-ner/94/2 ?

xu-neva commented 6 years ago

Instead of training on the regression objective or a beam searching algorithm, a second pass calibration could help to determine the mapping between the scores and the precision probabilities. For example, a Precision-Coverage curve drawn on a test set could tell, e.g., any parse with a score higher than 0.005 has 80% chance to be correct. then the parse can be tagged with 0.8 instead of 0.005 as the probability outputs to indicate the confidence.

In one of my use cases, I would like to set a very high precision, e.g. 95% and ignore any example with a parsing score lower than the threshold for the 95% precision. I might still get a pretty good coverage of my data, e.g., 50%. But with high-quality parse.

usamec commented 6 years ago

Also code suggested here produces memory leaks. You should do cleanup like this: https://github.com/explosion/spaCy/blob/master/spacy/syntax/nn_parser.pyx#L383

Globegitter commented 5 years ago

@jorgeaguiar thanks for that snippet that does seem to work for me most of the time, but sometimes I get negative end values - any idea how to interpret these? Also how come you do end - 1 rather than just using end?

Also @honnibal does this beam parsing only work on single words, or also on compound words? E.g. for new york it seems to give me confidence values for new and york it seems but not new york, if I am interpreting the results correctly. But maybe that has something to do with me getting negative values for end?

rbhambriiit commented 5 years ago

I think we should use end instead of end-1 Those are the correct indices as displayed by the vanilla end attribute inside doc.ents

The only inconsistency in the picture is that this beam decoding shows some ents with probs of 0.9 - which are no predicted by the NER model otherwise.

Also there are ones which are predicted but have very low prob.

Maybe this has to do with what @jorgeaguiar mentioned: Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong.

elbaulp commented 5 years ago

@jorgeaguiar

Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong.

For me it seems its working well, but as you said, I am using a custom NER model, trained from a blank one.

elbaulp commented 5 years ago

How should I do if I wanted to add those entities above certain threshold, let say those with probs > 70%, I am doing it this way:

for pred in preds:
    pick_from_probs = get_probatilities...
    for p in pick_from_probs:
        indexes = [set(range(ent.start, ent.end)) for ent in pred.ents]
        start, end = probs[p][0], probs[p][1]
        # Make sure the new span does not overlap any current one
        if not any([x.intersection(range(start, end)) for x in indexes]):
            span = Span(pred, start, end, label=p)
            pred.ents = pred.ents + (span,) # IS THIS RIGHT?

honnibal commented 4 years ago

I'm going to close this enhancement issue, because the regression objective idea just doesn't work. Confidence-sensitive NER is still a nice idea and we should investigate other ways of achieving it, but the discussion in this issue is old and kind of misleading now.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

1. Beam search with global objective

2. Optimize the negative cost explicitly (i.e. do numeric regression, not multiclass classification)