explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

[Enhancement request] Leverage pyBSD for finding sentence boundaries #10207

Closed LifeIsStrange closed 2 years ago

LifeIsStrange commented 2 years ago

https://github.com/nipunsadvilkar/pySBD The benchmark shows that its accuracy very significantly outperforms Spacy: https://github.com/nipunsadvilkar/pySBD/blob/master/artifacts/pysbd_poster.png

Since it is a bit slower it could make sense to make it deactivatable via a flag

pmbaumgartner commented 2 years ago

Hey @LifeIsStrange - Thanks for the issue.

It looks like this is available as a third-party component, but hasn't been updated to spaCy v3. I saw there was an open PR on the repository, I'd recommend looking at that code (link) to see how to implement this as a component using v3 and use that in your pipeline. I'll also put a slightly modified version that approach below:


import pysbd
import spacy
from spacy.language import Language

text = "My name is Jonas E. Smith.          Please turn to p. 55."
nlp = spacy.blank('en')

@Language.component("sbd")
def pysbd_sentence_boundaries(doc):
    seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
    sents_char_spans = seg.segment(doc.text)
    char_spans = [doc.char_span(sent_span.start, sent_span.end, alignment_mode="contract") for sent_span in sents_char_spans]
    start_token_ids = [span[0].idx for span in char_spans if span is not None]
    for token in doc:
        token.is_sent_start = True if token.idx in start_token_ids else False
    return doc

# add as a spacy pipeline component
nlp.add_pipe("sbd", first=True)

doc = nlp(text)
for sent in doc.sents:
    print(sent.text)

If you want to place it somewhere else in an existing pipeline relative to those components, there's also an example of that here.

LifeIsStrange commented 2 years ago

@pmbaumgartner Thanks for the help with setting it up :) I know it can be used alongside with spacy 3 however the goal of the issue would be to bundle it by default in order to improve sentence boundary detection for all and not just for academic nerds.

pmbaumgartner commented 2 years ago

Thanks for the suggestion. Since there's a third-party solution that works now, this is fine to stay as a third-party component. There's more included in pysbd than the sentence segmenter, so I don't think it's as simple as bundling what currently exists with spaCy. Generally, we prefer the flexibility offered by custom components for use in specific problems rather than incorporating and maintaining new functionality.

LifeIsStrange commented 2 years ago

Well i'd argue the core functionality is already performed by the base spacy parser and that there is room for accuracy improvements. However yes maybe the task is non-trivial and given a third party solution exists si understand it is not a priority but still seems like a worthwhile improvement long term.

Feel free to close the "issue"

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.