explosion / spaCy

πŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.85k stars 4.38k forks source link

Sentence segmentation silently fails with no POS tagger #1458

Closed DeNeutoy closed 6 years ago

DeNeutoy commented 6 years ago

Sentence segmentation returns a single unsegmented Span object consisting of the whole document if processed using a Language object which did not load a POS tagger.

Environment

Trying to do sentence segmentation without the parser throws an interpretable error πŸ‘

nlp = spacy.load("en", parser=False, vectors=False)
x = nlp("This is a sentence. Here is another one. This is another sentence.")
list(x.sents) # Throws error requiring dependency parser. Good.

These two cases silently return a single spacy.tokens.span.Span object consisting of the entire document.

nlp = spacy.load("en", tagger=False, vectors=False)
x = nlp("This is a sentence. Here is another one. This is another sentence.")
list(x.sents) # == ["This is a sentence. Here is another one. This is another sentence."]
nlp = spacy.load("en", tagger=False, parser=True, vectors=False)
x = nlp("This is a sentence. Here is another one. This is another sentence.")
list(x.sents) # == ["This is a sentence. Here is another one. This is another sentence."]

Seems like this could be fixed by simply requiring that Doc.is_tagged == True here. Happy to submit a PR for this, but it seems like a fix which may break stuff, so I thought i'd check here first.

honnibal commented 6 years ago

Hey,

The problem you're having is that the document is being parsed, it's just that the parser is basically useless if you don't run the tagger, because the input you're passing through at run-time is so different from the training data.

This is fixed in v2, because the parser no longer uses the POS tags as features.

We're pushing another release tonight or tomorrow, but you could already do pip install spacy-nightly. The docs are at https://alpha.spacy.io

We'll be pushing a release candidate for v2 as soon as we get the models retrained, and we finish the rest of the tests. All the target features are now implemented, and there are currently 0 open bugs on the repository :tada:

DeNeutoy commented 6 years ago

Awesome, looking forward to v2. Spacy is πŸ”₯.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.