IlyaGrebnov / libsais

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.
Apache License 2.0
180 stars 22 forks source link

Crash for file size close to 2GB #8

Closed wupengcheng6819 closed 2 years ago

wupengcheng6819 commented 2 years ago

As an old user looking forward to switch from libdivsufsort, I noticed libsais would crash as file size approaches 2GB (with or without giving extra space), while divsufsort won't as long as file size is strictly under 2G (210241024*1024). I wonder what is the max size doable without switching to 64-bit version.

IlyaGrebnov commented 2 years ago

The issue is likely due to 32-bit arithmetic overflow (somewhere in look-ahead / prefetch logic). Note, maximum theoretical size would be 2GB - 1 (due to EOF / sentinel symbol), but I recommend switching to libsais64 few KBs before 2GB limit. And for my own compressors (bsc and bsc-m03) I limit block size by 2047 MB.

wupengcheng6819 commented 2 years ago

As it turns out, the program crashes only with extra space. In my case, the file size is 2047MB, which runs fine with 0 extra space; The extra space size is around 1G when it crashes, and the debug error message is:

Program received signal SIGSEGV, Segmentation fault. 0x00000000004098bd in libsais_compact_unique_and_nonunique_lms_suffixes_32s ()

IlyaGrebnov commented 2 years ago

Thank you. Now I know where problem is. n + fs value is overflowing signed 32-bit integer. I will add fix for this in next few days by capping fs parameter values to correct range. That said, for large files (>100MB) you typically do not need any extra free space as libsais should be able to carve enough unused space inside suffix array itself.

IlyaGrebnov commented 2 years ago

Fixed in 2.6.5 (Capped free space parameter to avoid crashing due to 32-bit integer overflow).

wupengcheng6819 commented 2 years ago

Thanks for the fix and worked like a charm!