Arpan3323 commented 1 week ago

Hi, I have confirmed the existence of an addition of unsigned offset runtime error at this line (527): int offset = positions[i-1] + sizeS2; due to invalid index when i is 0. There are a few redundancies in this function as well. https://github.com/3dem/relion/blob/f2e59d6ec61d3f92df31cebbb7402f1012b17a9e/src/strings.cpp#L487

line 504: if (newPos < 0) could also be problematic and a better approach may be to use std::string::npos

Crash report

Below is the crash report generated on fuzz testing this function in its current state:

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1375119981
INFO: Loaded 1 modules   (1219 inline 8-bit counters): 1219 [0x5573a256f980, 0x5573a256fe43), 
INFO: Loaded 1 PC tables (1219 PCs): 1219 [0x5573a256fe48,0x5573a2574a78), 
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 4096 bytes
INFO: A corpus is not provided, starting from an empty corpus
#2  INITED cov: 2 ft: 2 corp: 1/1b exec/s: 0 rss: 35Mb
/usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34: runtime error: addition of unsigned offset to 0x5020000003b0 overflowed to 0x5020000003ac
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34 
=================================================================
==40219==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5020000003ac at pc 0x5573a251d74e bp 0x7ffd08b450f0 sp 0x7ffd08b450e8
READ of size 4 at 0x5020000003ac thread T0
    #0 0x5573a251d74d in splitString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&, bool) strings.cpp:527:16
    #1 0x5573a2523a47 in LLVMFuzzerTestOneInput fuzzer.cpp:19:5
    #2 0x5573a241ede4 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (stringsFuzzer_splitString+0x57de4) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #3 0x5573a241e4d9 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) (stringsFuzzer_splitString+0x574d9) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #4 0x5573a241fcc5 in fuzzer::Fuzzer::MutateAndTestOne() (stringsFuzzer_splitString+0x58cc5) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #5 0x5573a2420825 in fuzzer::Fuzzer::Loop(std::vector<fuzzer::SizedFile, std::allocator<fuzzer::SizedFile>>&) (stringsFuzzer_splitString+0x59825) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #6 0x5573a240daff in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (stringsFuzzer_splitString+0x46aff) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #7 0x5573a2438186 in main (stringsFuzzer_splitString+0x71186) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #8 0x7f38810d11c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #9 0x7f38810d128a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #10 0x5573a2402ae4 in _start (stringsFuzzer_splitString+0x3bae4) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)

0x5020000003ac is located 4 bytes before 4-byte region [0x5020000003b0,0x5020000003b4)
allocated by thread T0 here:
    #0 0x5573a2511511 in operator new(unsigned long) (stringsFuzzer_splitString+0x14a511) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #1 0x5573a252289c in std::__new_allocator<int>::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    #2 0x5573a252289c in std::allocator_traits<std::allocator<int>>::allocate(std::allocator<int>&, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    #3 0x5573a252289c in std::_Vector_base<int, std::allocator<int>>::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    #4 0x5573a252289c in void std::vector<int, std::allocator<int>>::_M_realloc_insert<int const&>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int>>>, int const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/vector.tcc:459:33
    #5 0x5573a251c960 in std::vector<int, std::allocator<int>>::push_back(int const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1292:4
    #6 0x5573a251c960 in splitString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&, bool) strings.cpp:514:13
    #7 0x5573a2523a47 in LLVMFuzzerTestOneInput fuzzer.cpp:19:5
    #8 0x5573a241ede4 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (stringsFuzzer_splitString+0x57de4) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #9 0x5573a241e4d9 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) (stringsFuzzer_splitString+0x574d9) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #10 0x5573a241fcc5 in fuzzer::Fuzzer::MutateAndTestOne() (stringsFuzzer_splitString+0x58cc5) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #11 0x5573a2420825 in fuzzer::Fuzzer::Loop(std::vector<fuzzer::SizedFile, std::allocator<fuzzer::SizedFile>>&) (stringsFuzzer_splitString+0x59825) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #12 0x5573a240daff in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (stringsFuzzer_splitString+0x46aff) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #13 0x5573a2438186 in main (stringsFuzzer_splitString+0x71186) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)
    #14 0x7f38810d11c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #15 0x7f38810d128a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
    #16 0x5573a2402ae4 in _start (stringsFuzzer_splitString+0x3bae4) (BuildId: 19fc222ea9715e7610501988538579823e16b7f1)

SUMMARY: AddressSanitizer: heap-buffer-overflow strings.cpp:527:16 in splitString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&, bool)
Shadow bytes around the buggy address:
  0x502000000100: fa fa 00 fa fa fa 01 fa fa fa 00 fa fa fa 00 00
  0x502000000180: fa fa 00 fa fa fa fd fa fa fa fd fa fa fa fd fd
  0x502000000200: fa fa fd fa fa fa fd fa fa fa 00 00 fa fa fd fa
  0x502000000280: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x502000000300: fa fa fd fa fa fa 04 fa fa fa fd fa fa fa fd fa
=>0x502000000380: fa fa 04 fa fa[fa]04 fa fa fa fa fa fa fa fa fa
  0x502000000400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000000480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000000500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000000580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x502000000600: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==40219==ABORTING
MS: 5 ShuffleBytes-ChangeBit-CopyPart-ChangeBit-CopyPart-; base unit: adc83b19e793491b1c6ea0fd8b46cd9f32e592fc
0xb,0xb,0x3,0x3,
\013\013\003\003
artifact_prefix='./'; Test unit written to ./crash-33a99f8b9e1d7bdb67edef16b34ca98e166e032f
Base64: CwsDAw==
stat::number_of_executed_units: 12
stat::average_exec_per_sec:     0
stat::new_units_added:          0
stat::slowest_unit_time_sec:    0
stat::peak_rss_mb:              37

I can submit a pull request that fixes this issue and potentially simplifies the splitting algorithm. Please let me know if you have any questions or would like to fix this in a specific way :)

biochem-fan commented 1 week ago

Thank you very much for your report.

Could you please show me an example input that causes the problem? It is possible that the function is broken but the issue is never triggered by valid input files.

In general, we do not guarantee the program does not crash when given a corrupted input. We also assume the input files can be trusted. i.e., we cannot guarantee the program is fully guarded and safe against malicious attackers. Certainly this is far from ideal, but given our limited resource and the nature of the program (scientific program used only by experts), we cannot put a high priority on fully validating and fixing existing codes.

Arpan3323 commented 1 week ago

Below are two examples that can trigger this bug:

This example does not use any random inputs provided by libFuzzer
```
#include <cstdint>
#include <cstring>
#include <string>
```

include "src/strings.h"

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) { if(Size < 4) {return 0;} std::string input = "one,two,three,a,b,c,1,2,3"; std::string delimiter = ","; std::vector results = {"a", "b", "c"}; splitString(input, delimiter, results, true); return 0; }

2. This example is how I originally found the bug. libFuzzer provides me with a byte pointer `Data` and the `Size` of the block that `Data` is pointing to. I use these to construct null-terminated, valid but random strings.
```C++
#include <cstdint>
#include <cstring>
#include <string>

#include "src/strings.h"

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size)
{
    if(Size < 4) {return 0;}
    std::string input(reinterpret_cast<const char*>(Data), Size);
    std::string delimiter(reinterpret_cast<const char*>(Data), Size*0.5);
    std::string res1(reinterpret_cast<const char*>(Data), Size*0.75);
    std::string res2(reinterpret_cast<const char*>(Data), Size*0.5);
    std::string res3(reinterpret_cast<const char*>(Data), Size*0.25); 
    std::vector<std::string> results = {res1, res2, res3};
    splitString(input, delimiter, results, true);
    return 0;
}

If you would like, I can also explain how you can link and execute the fuzzer.

I think the thing to notice here are these 2 lines from the crash report:

/usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34: runtime error: addition of unsigned offset to 0x5020000003b0 overflowed to 0x5020000003ac SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34
SUMMARY: AddressSanitizer: heap-buffer-overflow strings.cpp:527:16 in splitString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&, bool)

if we look at strings.cpp:527:16 we can justify why we are getting these two lines in the crash report. If we give valid inputs so that checks verifying the size of inputs and number of delimiters in the string, do not exit the function and the final loop that pushes tokens into the results vector is executed, a crash is inevitable because line int offset = positions[i-1] + sizeS2; will do an out-of-bounds lookup in std::vector< int > positions. Below is the loop where problem occurs:

    for (int i = 0; i <= static_cast< int >(positions.size()); i++) // Changing this to: for (int i = 1; i <= static_cast< int >(positions.size()); i++) is one way to avoid this but assignment of s will have to be updated like so if (i == 1) s = input.substr(0, positions[0]);
    {
        std::string s("");
        if (i == 0)
            s = input.substr(i, positions[i]);
        int offset = positions[i-1] + sizeS2; // When i == 0, positions[i-1] == positions[-1] and this index is out-of-bounds   
        if (offset < isize)
        {
            if (i == positions.size())
                s = input.substr(offset);
            else if (i > 0)
                s = input.substr(positions[i-1] + sizeS2,
                                 positions[i] - positions[i-1] - sizeS2);
        }
        if (includeEmpties || s.size() > 0)
            results.push_back(s);
    }

I agree and understand that it is very challenging to maintain legacy code with limited resources. I also want to say that relion is an extremely useful software and plays a key role in helping experts understand the structure of proteins with precision and clarity but I digress. I believe that with some code-review from you, I can propose a fix and potentially fix this issue. Let me know if you have any more questions :)

biochem-fan commented 1 week ago

Thank you very much for the details. Please give me a few days, because I am busy with other things at the moment.

biochem-fan commented 6 days ago

Just an update: I reproduced the issue in valgrind.

biochem-fan commented 6 days ago

Hi, does this fix the issues?

{
        results.clear();
        size_t iPos, newPos;
        size_t sizeS2 = delimiter.size();
        size_t isize = input.size();

        if (isize == 0 || sizeS2 == 0)
                return 0;

        std::vector<size_t> positions;
        newPos = input.find(delimiter);

        // No delimiter
        if (newPos == std::string::npos)
        {
                results.push_back(input);
                return 1;
        }

        int numFound = 0;
        while (newPos != std::string::npos)
        {
                numFound++;
                positions.push_back(newPos);
                iPos = newPos;
                newPos = input.find(delimiter, iPos + sizeS2);
        }

        for (size_t i = 0; i <= numFound; i++)
        {
                std::string s;

                // First element; no delimiter at the beginning
                if (i == 0)
                {
                        s = input.substr(i, positions[i]);
                }
                else
                {
                        int offset = positions[i - 1] + sizeS2;
//                      std::cout << "i = " << i << " positions[i - 1] = " << positions[i - 1] << " offset = " << offset << std::endl;

                        // The last element
                        if (i == numFound)
                                s = input.substr(offset);
                        else
                                s = input.substr(offset, positions[i] - offset);
                }

                if (includeEmpties || s.size() > 0)
                        results.push_back(s);
        }
        return numFound;
}

Arpan3323 commented 6 days ago

Hi, I have checked your latest update of the algorithm and this does fix the original issue of out of bounds access.

I may be understanding this incorrectly but there seems to be an issue with returned value. For example:

if input is: "one", the function returns 1 and there is now 1 token in std::vector<std::string> results
if input is: "one, two", the function still returns 1 but there are now 2 tokens in std::vector<std::string> results
if input is: "one,two,three,a,b,c,1,2,3" the function returns 8 but there are now 9 tokens in std::vector<std::string> results

There seems to be an inconsistency in the returned value. If we consider the return value to be a 0-indexed value, it would be correct only when std::vector<std::string> results has more than one token (ex: returns 3 when there are 4 tokens) but this consistency is broken when there is only one token (returns 1 instead of 0 when there is 1 token).

I think this is because how numFound is being used after we find a delimiter. The simplest fix to this is to: return ++numFound as opposed to return numFound on the last line. This allows the returned value to correctly indicate the number of tokens in std::vector<std::string> results (returns X for X tokens in std::vector<std::string> results). Let me know if you think I am mistaken.

biochem-fan commented 5 days ago

You are correct. This inconsistent behavior is the same as the original (broken) code.

Most of the client codes do not use the return value (assigned to a variable but never referenced) but there is one place this is actually used:

https://github.com/3dem/relion/blob/f2e59d6ec61d3f92df31cebbb7402f1012b17a9e/src/pipeliner.cpp#L817-L826

Apparently the code assumes the return value is the number of tokens (i.e. the same as results.size()). This assumption is wrong. Unless fn_jobids has only one job, the last job is ignored!

@scheres I suggest changing the function to void and updating the pipeliner code to use jobids.size() instead. Do you agree?

biochem-fan commented 5 days ago

In fix-splitString branch, 4e41e0d fixes the memory access bug and d00de11 removes the broken return value and fixes pipeliner. @scheres, please check the latter commit carefully, because it touches several programs.

scheres commented 5 days ago

This looks good to me!

biochem-fan commented 5 days ago

@scheres Thanks for confirmation. I merged this into ver4.0 and ver5.0.

@Arpan3323 Thank you very much again for finding and investigating this bug.

scheres commented 4 days ago

Thanks s lot, Takanori!

3dem / relion

Bugs found in `int splitString(const std::string& input, const std::string& delimiter, ...)` in src/strings.cpp #1203

Crash report

include "src/strings.h"