bab2min / kiwipiepy

Python API for Kiwi
Other
282 stars 27 forks source link

Segmentation fault with long repetitive sequences #158

Closed batterseapower closed 7 months ago

batterseapower commented 8 months ago

If I run this:

from kiwipiepy import Kiwi
len(Kiwi().tokenize('보통' * 40000))

Then python dies with a segmentation fault. Running it in gdb, the backtest looks like:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff6a16594 in unsigned long kiwi::splitByTrie<(kiwi::ArchType)5, false>(std::vector<kiwi::KGraphNode, mi_stl_allocator<kiwi::KGraphNode> >&, kiwi::Form const*, unsigned long const*, kiwi::utils::FrozenTrie<char16_t, kiwi::Form const*, int, kiwi::utils::detail::HasSubmatch<kiwi::Form const*, void> > const&, nonstd::sv_lite::basic_string_view<char16_t, std::char_traits<char16_t> >, unsigned long, kiwi::Match, unsigned long, unsigned long, float, kiwi::PretokenizedSpanGroup::Span const*&, kiwi::PretokenizedSpanGroup::Span const*) ()

If I change 40000 to 20000 then the program works OK, though it takes about 10 seconds to run the tokenization.

I'm using version 0.16.2

bab2min commented 8 months ago

Hi @batterseapower, thank you for reporting the bug. I'll examine it.

bab2min commented 7 months ago

@batterseapower At v0.17.0, the bug was fixed and the performance of long input was improved.