Closed batterseapower closed 7 months ago
If I run this:
from kiwipiepy import Kiwi len(Kiwi().tokenize('보통' * 40000))
Then python dies with a segmentation fault. Running it in gdb, the backtest looks like:
Thread 1 "python" received signal SIGSEGV, Segmentation fault. 0x00007ffff6a16594 in unsigned long kiwi::splitByTrie<(kiwi::ArchType)5, false>(std::vector<kiwi::KGraphNode, mi_stl_allocator<kiwi::KGraphNode> >&, kiwi::Form const*, unsigned long const*, kiwi::utils::FrozenTrie<char16_t, kiwi::Form const*, int, kiwi::utils::detail::HasSubmatch<kiwi::Form const*, void> > const&, nonstd::sv_lite::basic_string_view<char16_t, std::char_traits<char16_t> >, unsigned long, kiwi::Match, unsigned long, unsigned long, float, kiwi::PretokenizedSpanGroup::Span const*&, kiwi::PretokenizedSpanGroup::Span const*) ()
If I change 40000 to 20000 then the program works OK, though it takes about 10 seconds to run the tokenization.
I'm using version 0.16.2
Hi @batterseapower, thank you for reporting the bug. I'll examine it.
@batterseapower At v0.17.0, the bug was fixed and the performance of long input was improved.
If I run this:
Then python dies with a segmentation fault. Running it in gdb, the backtest looks like:
If I change 40000 to 20000 then the program works OK, though it takes about 10 seconds to run the tokenization.
I'm using version 0.16.2