fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
200 stars 34 forks source link

Parenthesis at the end of input cause IndexError #19

Closed windreamer closed 2 years ago

windreamer commented 2 years ago

Hi folks, I like this cool segmenter for quality and speed, but something is a bit weird.

from syntok.segmenter import analyze
text='''Alexandri Aetoli Testimonia et Fragmenta. Studi e Testi 15. (1999)'''

for p in analyze(text):
    for s in p:
        print(' '.join(str(t) for t in s))

I got:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-1217f364130d> in <module>
      1 for p in analyze(text):
----> 2     for s in p:
      3         print(' '.join(str(t) for t in s))
      4

~/Codebase/toolchain/__pypackages__/3.9/lib/syntok/segmenter.py in segment(tokens, bracket_skip_len)
    106         State.max_bracket_skipping_length = int(bracket_skip_len)
    107
--> 108     for state in Begin(tokens):
    109         if state.at_sentence:
    110             history = state.collect_history()

~/Codebase/toolchain/__pypackages__/3.9/lib/syntok/_segmentation_states.py in __iter__(self)
    128         while state is not None:
    129             yield state
--> 130             state = next(state, None)
    131
    132     @abstractmethod

~/Codebase/toolchain/__pypackages__/3.9/lib/syntok/_segmentation_states.py in __next__(self)
    468                 return Terminal(self._stream, self._queue, self._history)
    469
--> 470             self._move()  # Do not skip parenthesis if they open the sentence.
    471
    472             if self.next_is_a_terminal:

~/Codebase/toolchain/__pypackages__/3.9/lib/syntok/_segmentation_states.py in _move(self)
    324     def _move(self) -> bool:
    325         """Advance the queue, storing the old value in history."""
--> 326         self.__history.append(self.__queue.pop(0))
    327
    328         if not self.__queue:

IndexError: pop from empty list

Is there any one can help me on it?

fnl commented 2 years ago

Looks like a regression from my latest update on handling parenthesis. Your phrase probably needs to converted to a test case, analyzed, and fixed. Can you confirm if any 1.3 version works?

windreamer commented 2 years ago

@fnl syntok=1.3.3 looks good.

➜  test pdm run python test.py
Alexandri  Aetoli  Testimonia  et  Fragmenta .
 Studi  e  Testi  15 .
 ( 1999 )
➜  test pdm list --freeze
regex==2022.1.18
syntok==1.3.3
fnl commented 2 years ago

This was a regression introduced by 1.4.1. Thank you for pointing out the issue and helping in its review, @windreamer. The issue is fixed in the latest release v1.4.2.

windreamer commented 2 years ago

Thx @fnl for this quick fix!