explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.76k stars 4.36k forks source link

Incorrect sentence segmentation when supplying tokenization #752

Closed smartschat closed 7 years ago

smartschat commented 7 years ago

When supplying spaCy with pre-computed tokenization I sometimes get weird sentence segmentation results.

An example:

import spacy

nlp = spacy.load("en")

tokens = ['Dr', 'Conrad', 'Murray', ',', 'who', 'police', 'say', 'is', 'not', 'a', 
          'suspect', ',', 'was', 'at', 'Jackson', "'s", 'mansion', 'and', 'tried', 
          'to', 'revive', 'him', 'before', 'he', 'died', '.', 'Assistant', 'Special', 
          'Agent', 'Michael', 'Flanigan', 'of', 'the', 'Las', 'Vegas', 'branch', 'of', 
          'the', 'Drug', 'Enforcement', 'Agency', '-LRB-', 'DEA', '-RRB-', 'said', 
          'the', 'operation', 'was', 'likely', 'to', 'take', '``', 'a', 'couple', 
          'of', 'hours', "''", '.']

doc = spacy.tokens.Doc(nlp.vocab, words=tokens)

nlp.parser(doc)

for sent in doc.sents:
    print(sent)

The output then is:

Dr Conrad Murray , who police say is not a suspect , was at Jackson 's mansion and tried to revive him before he died . Assistant Special Agent Michael Flanigan of the
Las Vegas branch of
the Drug Enforcement Agency -LRB-
DEA -RRB-
said the
operation was likely to take
`` a couple of hours ''
.

I'm running spaCy 1.6 with Python 3.5.2 on Debian GNU/Linux 8 (jessie) 64-bit.

honnibal commented 7 years ago

spaCy speaks text, not Treebankese :).

You need to unescape your -RRB-, -LRB-, '', `` etc. These aren't words, so spaCy isn't trained on them, and the model has no idea what they might mean.

smartschat commented 7 years ago

Thank you for the fast reply! However, even after replacing the bracket and quotation tokens the parsing/sentence splitting produces similar output.

In contrast, when running

doc = nlp("Dr Conrad Murray, who police say is not a suspect, was at Jackson's mansion and tried to revive  him before he died. Assistant Special Agent Michael Flanigan of the Las Vegas branch of the Drug Enforcement Agency -LRB- DEA -RRB- said the operation was likely to take ``a couple of hours''.")

the sentence splitting is correct.

honnibal commented 7 years ago

Oh right. You didn't run the tagger.

nlp.tagger(doc)

honnibal commented 7 years ago

Btw, are you interested in getting cort ported to spaCy? I would definitely support that :)

honnibal commented 7 years ago

There's really not so much to do...I even have a beam parser in another branch:

Parser: https://github.com/explosion/spaCy/blob/july16/spacy/syntax/beam_parser.pyx

Structured Averaged Perceptron and structured MLP: https://github.com/explosion/spaCy/blob/july16/spacy/syntax/_neural.pyx

The parser uses an imitation learning strategy with a dynamic oracle, so the path through the transition-space is left latent. I've had good success with using the maximum violation update with beam-search to train this. The code above also supports an inexact memoisation mechanism: a hash of each state is calculated, including the history, so that strictly dominated states can be pruned from the search.

smartschat commented 7 years ago

Thanks, that resolved the issue!

I am very interested in getting cort ported to spaCy! I'll contact you.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.