asarubbo commented 1 year ago

Hello,

I reported multiple vulnerabilities in bzip3, and some of these looks to be related to libsais.

bzip3 uses its own version of libsais, but you may want to see the report to understand if those issues affect your libsais as well.

As a side note: If you have C tests that uses libsais to parse untrusted input, I can help discover bugs with fuzzing.

kspalaiologos commented 1 year ago

According to Ilya:

This input is not valid BWT string. I attempted to decode it with different algorithm (see below) and it failed as well. libsais uses variation of BWT with sentinel symbol ($) as oppose to variation with cyclic permutation (like in bzip2). This BWT variations are not compatible and behavior of libsais_unbwt is not defined for such inputs. (#10)

Hence I doubt that the undefined behaviour in libsais on invalid inputs will be fixed.

IlyaGrebnov commented 1 year ago

I am currently dedicating my efforts on implementing further enhancements for libcubwt. However, I would like to assure that I will investigate this issue, likely within the next few weeks.

IlyaGrebnov commented 1 year ago

@kspalaiologos and @asarubbo I would like to provide an update on the fuzz testing using libFuzzer I conducted on libsais.

Correctness Issues

After extensive testing, no correctness issues were identified for libsais. It was able to successfully encode and decode any arbitrary string generated by the fuzzer.

Undefined Behaviors

However, during the testing, I did identify two undefined behaviors with libsais:

Invalid input for BWT Decoding: Not all string plus index pairs are valid inputs for BWT decoding because there are no strings that could generate them. As a result, libsais will crash on this inputs.
Uninitialized read errors: Sometimes, libsais attempts to read and prefetch look-ahead values that might not be populated yet. While this is not a correctness issue since prefetch instructions are non-faulting and merely a hardware hint, memory sanitizer triggers uninitialized read errors.

Recommendation

Based on fuzzy testing, both of these undefined behaviors could be eliminated by clearing the input SA and A arrays before calling libsais.

Investigation into bzip3 Source Code

In addition to the above, I also looked into the bzip3 source code to determine the root cause of the discovered bugs. After analyzing the code, it appears that calls to libsais are made with less allocated space than required by the algorithm. Specifically, the SA array needs to be at least n bytes and A array at least n + 1 bytes. However, from my investigation, it seems that arrays are allocated by block_size + 128 bytes, while the lzp_size is bound by input_size + input_size / 50 + 32 bytes. Assuming that block size and input size are the same, the LZP could inflate the input by 2%, which leads to the inflate size being larger than the allocated space for libsais (input_size + 2% > block_size + 128), ultimately resulting in the crash.

I hope this details is sufficient to improve the reliability of bzip3. Please let me know if you have any questions or concerns.

kspalaiologos commented 1 year ago

Thanks for all the valuable input! I have implemented your recommendations and now I am fuzzing bzip3 with AFL on my machines. If there's anything left related to libsais I will make sure to let you know.

asarubbo commented 1 year ago

@kspalaiologos and @asarubbo I would like to provide an update on the fuzz testing using libFuzzer I conducted on libsais.

Thanks for the update and for you valuable work. If you want to share the code used for fuzz testing, I (and other people) can use it too to report further issues, or you can consider https://github.com/google/oss-fuzz

IlyaGrebnov / libsais

multiple vulnerabilities in bzip3 #13

Correctness Issues

Undefined Behaviors

Recommendation

Investigation into bzip3 Source Code