Closed asarubbo closed 1 year ago
According to Ilya:
This input is not valid BWT string. I attempted to decode it with different algorithm (see below) and it failed as well. libsais uses variation of BWT with sentinel symbol ($) as oppose to variation with cyclic permutation (like in bzip2). This BWT variations are not compatible and behavior of libsais_unbwt is not defined for such inputs. (#10)
Hence I doubt that the undefined behaviour in libsais on invalid inputs will be fixed.
I am currently dedicating my efforts on implementing further enhancements for libcubwt. However, I would like to assure that I will investigate this issue, likely within the next few weeks.
@kspalaiologos and @asarubbo I would like to provide an update on the fuzz testing using libFuzzer I conducted on libsais.
After extensive testing, no correctness issues were identified for libsais. It was able to successfully encode and decode any arbitrary string generated by the fuzzer.
However, during the testing, I did identify two undefined behaviors with libsais:
Invalid input for BWT Decoding: Not all string plus index pairs are valid inputs for BWT decoding because there are no strings that could generate them. As a result, libsais will crash on this inputs.
Uninitialized read errors: Sometimes, libsais attempts to read and prefetch look-ahead values that might not be populated yet. While this is not a correctness issue since prefetch instructions are non-faulting and merely a hardware hint, memory sanitizer triggers uninitialized read errors.
Based on fuzzy testing, both of these undefined behaviors could be eliminated by clearing the input SA and A arrays before calling libsais.
In addition to the above, I also looked into the bzip3 source code to determine the root cause of the discovered bugs. After analyzing the code, it appears that calls to libsais are made with less allocated space than required by the algorithm. Specifically, the SA array needs to be at least n bytes and A array at least n + 1 bytes. However, from my investigation, it seems that arrays are allocated by block_size + 128 bytes, while the lzp_size is bound by input_size + input_size / 50 + 32 bytes. Assuming that block size and input size are the same, the LZP could inflate the input by 2%, which leads to the inflate size being larger than the allocated space for libsais (input_size + 2% > block_size + 128), ultimately resulting in the crash.
I hope this details is sufficient to improve the reliability of bzip3. Please let me know if you have any questions or concerns.
Thanks for all the valuable input! I have implemented your recommendations and now I am fuzzing bzip3 with AFL on my machines. If there's anything left related to libsais I will make sure to let you know.
@kspalaiologos and @asarubbo I would like to provide an update on the fuzz testing using libFuzzer I conducted on libsais.
Thanks for the update and for you valuable work. If you want to share the code used for fuzz testing, I (and other people) can use it too to report further issues, or you can consider https://github.com/google/oss-fuzz
Hello,
I reported multiple vulnerabilities in bzip3, and some of these looks to be related to libsais.
bzip3 uses its own version of libsais, but you may want to see the report to understand if those issues affect your libsais as well.
As a side note: If you have C tests that uses libsais to parse untrusted input, I can help discover bugs with fuzzing.