allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

Sentencizer error for a particular abstract #207

Closed kyleclo closed 3 years ago

kyleclo commented 4 years ago

On the following abstract:

Evolutionary algorithms (EAs) form a popular optimisation paradigm inspired by natural evolution. In recent years the field of evolutionary computation has developed a rigorous analytical theory to analyse their runtime on many illustrative problems. Here we apply this theory to a simple model of natural evolution. In the Strong Selection Weak Mutation (SSWM) evolutionary regime the time between occurrence of new mutations is much longer than the time it takes for a new beneficial mutation to take over the population. In this situation, the population only contains copies of one genotype and evolution can be modelled as a (1+1)-type process where the probability of accepting a new genotype (improvements or worsenings) depends on the change in fitness.  We present an initial runtime analysis of SSWM, quantifying its performance for various parameters and investigating differences to the (1+1)EA. We show that SSWM can have a moderate advantage over the (1+1)EA at crossing fitness valleys and study an example where SSWM outperforms the (1+1)EA by taking advantage of information on the fitness gradient.

calling nlp(text) results in this error

~/.conda/envs/transformers/lib/python3.7/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    433             if not hasattr(proc, "__call__"):
    434                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 435             doc = proc(doc, **component_cfg.get(name, {}))
    436             if doc is None:
    437                 raise ValueError(Errors.E005.format(name=name))

~/.conda/envs/transformers/lib/python3.7/site-packages/scispacy/custom_sentence_segmenter.py in combined_rule_sentence_segmenter(doc)
     55             built_up_sentence = token.text_with_ws
     56             segment_index += 1
---> 57             current_segment = segments[segment_index]
     58         else:
     59             built_up_sentence += token.text_with_ws

IndexError: list index out of range

This is using:

from spacy.language import Language
from scispacy.custom_sentence_segmenter import combined_rule_sentence_segmenter

nlp = spacy.load('en_core_sci_sm', disable=["tagger", "parser", "textcat", "ner"])
Language.factories['combined_rule_sentence_segmenter'] = lambda nlp, **cfg: combined_rule_sentence_segmenter
nlp.add_pipe(nlp.create_pipe('combined_rule_sentence_segmenter'), first=True)

on version 0.2.4.

dakinggg commented 4 years ago

Yeah, Tom ran into this as well. It turns out that pysbd can segment a sentence in the middle of a spacy token, and I didn't handle that case, not even sure what the right thing to do in that case would be. It seemed like there were a fair number of different patterns this could happen with, and issues in pysbd haven't been fixed for quite a while. So at this point I think you should either 1) run pysbd on its own and then scispacy over individual sentences (not sure if there are other pysbd problems that might make this hard) 2) use the built in dep parser based splitter or 3) use the built in rule based splitter (https://spacy.io/api/sentencizer).

DeNeutoy commented 3 years ago

This should be fixed now because we've upgraded Pysbd 👍