Closed Arpan3323 closed 5 days ago
Thank you very much for your report.
Could you please show me an example input that causes the problem? It is possible that the function is broken but the issue is never triggered by valid input files.
In general, we do not guarantee the program does not crash when given a corrupted input. We also assume the input files can be trusted. i.e., we cannot guarantee the program is fully guarded and safe against malicious attackers. Certainly this is far from ideal, but given our limited resource and the nature of the program (scientific program used only by experts), we cannot put a high priority on fully validating and fixing existing codes.
Below are two examples that can trigger this bug:
#include <cstdint>
#include <cstring>
#include <string>
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size)
{
if(Size < 4) {return 0;}
std::string input = "one,two,three,a,b,c,1,2,3";
std::string delimiter = ",";
std::vector
2. This example is how I originally found the bug. libFuzzer provides me with a byte pointer `Data` and the `Size` of the block that `Data` is pointing to. I use these to construct null-terminated, valid but random strings.
```C++
#include <cstdint>
#include <cstring>
#include <string>
#include "src/strings.h"
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size)
{
if(Size < 4) {return 0;}
std::string input(reinterpret_cast<const char*>(Data), Size);
std::string delimiter(reinterpret_cast<const char*>(Data), Size*0.5);
std::string res1(reinterpret_cast<const char*>(Data), Size*0.75);
std::string res2(reinterpret_cast<const char*>(Data), Size*0.5);
std::string res3(reinterpret_cast<const char*>(Data), Size*0.25);
std::vector<std::string> results = {res1, res2, res3};
splitString(input, delimiter, results, true);
return 0;
}
If you would like, I can also explain how you can link and execute the fuzzer.
I think the thing to notice here are these 2 lines from the crash report:
/usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34: runtime error: addition of unsigned offset to 0x5020000003b0 overflowed to 0x5020000003ac SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/stl_vector.h:1129:34
SUMMARY: AddressSanitizer: heap-buffer-overflow strings.cpp:527:16 in splitString(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>&, bool)
if we look at strings.cpp:527:16
we can justify why we are getting these two lines in the crash report. If we give valid inputs so that checks verifying the size of inputs and number of delimiters in the string, do not exit the function and the final loop that pushes tokens into the results
vector is executed, a crash is inevitable because line int offset = positions[i-1] + sizeS2;
will do an out-of-bounds lookup in std::vector< int > positions
. Below is the loop where problem occurs:
for (int i = 0; i <= static_cast< int >(positions.size()); i++) // Changing this to: for (int i = 1; i <= static_cast< int >(positions.size()); i++) is one way to avoid this but assignment of s will have to be updated like so if (i == 1) s = input.substr(0, positions[0]);
{
std::string s("");
if (i == 0)
s = input.substr(i, positions[i]);
int offset = positions[i-1] + sizeS2; // When i == 0, positions[i-1] == positions[-1] and this index is out-of-bounds
if (offset < isize)
{
if (i == positions.size())
s = input.substr(offset);
else if (i > 0)
s = input.substr(positions[i-1] + sizeS2,
positions[i] - positions[i-1] - sizeS2);
}
if (includeEmpties || s.size() > 0)
results.push_back(s);
}
I agree and understand that it is very challenging to maintain legacy code with limited resources. I also want to say that relion is an extremely useful software and plays a key role in helping experts understand the structure of proteins with precision and clarity but I digress. I believe that with some code-review from you, I can propose a fix and potentially fix this issue. Let me know if you have any more questions :)
Thank you very much for the details. Please give me a few days, because I am busy with other things at the moment.
Just an update: I reproduced the issue in valgrind
.
Hi, does this fix the issues?
{
results.clear();
size_t iPos, newPos;
size_t sizeS2 = delimiter.size();
size_t isize = input.size();
if (isize == 0 || sizeS2 == 0)
return 0;
std::vector<size_t> positions;
newPos = input.find(delimiter);
// No delimiter
if (newPos == std::string::npos)
{
results.push_back(input);
return 1;
}
int numFound = 0;
while (newPos != std::string::npos)
{
numFound++;
positions.push_back(newPos);
iPos = newPos;
newPos = input.find(delimiter, iPos + sizeS2);
}
for (size_t i = 0; i <= numFound; i++)
{
std::string s;
// First element; no delimiter at the beginning
if (i == 0)
{
s = input.substr(i, positions[i]);
}
else
{
int offset = positions[i - 1] + sizeS2;
// std::cout << "i = " << i << " positions[i - 1] = " << positions[i - 1] << " offset = " << offset << std::endl;
// The last element
if (i == numFound)
s = input.substr(offset);
else
s = input.substr(offset, positions[i] - offset);
}
if (includeEmpties || s.size() > 0)
results.push_back(s);
}
return numFound;
}
Hi, I have checked your latest update of the algorithm and this does fix the original issue of out of bounds access.
I may be understanding this incorrectly but there seems to be an issue with returned value. For example:
"one"
, the function returns 1
and there is now 1 token in std::vector<std::string> results
"one, two"
, the function still returns 1
but there are now 2 tokens in std::vector<std::string> results
"one,two,three,a,b,c,1,2,3"
the function returns 8
but there are now 9 tokens in std::vector<std::string> results
There seems to be an inconsistency in the returned value. If we consider the return value to be a 0-indexed value, it would be correct only when std::vector<std::string> results
has more than one token (ex: returns 3
when there are 4 tokens) but this consistency is broken when there is only one token (returns 1
instead of 0
when there is 1 token).
I think this is because how numFound
is being used after we find a delimiter. The simplest fix to this is to: return ++numFound
as opposed to return numFound
on the last line. This allows the returned value to correctly indicate the number of tokens in std::vector<std::string> results
(returns X for X tokens in std::vector<std::string> results
). Let me know if you think I am mistaken.
You are correct. This inconsistent behavior is the same as the original (broken) code.
Most of the client codes do not use the return value (assigned to a variable but never referenced) but there is one place this is actually used:
Apparently the code assumes the return value is the number of tokens (i.e. the same as results.size()
). This assumption is wrong. Unless fn_jobids
has only one job, the last job is ignored!
@scheres I suggest changing the function to void
and updating the pipeliner code to use jobids.size()
instead. Do you agree?
In fix-splitString
branch, 4e41e0d fixes the memory access bug and d00de11 removes the broken return value and fixes pipeliner. @scheres, please check the latter commit carefully, because it touches several programs.
This looks good to me!
@scheres Thanks for confirmation. I merged this into ver4.0
and ver5.0
.
@Arpan3323 Thank you very much again for finding and investigating this bug.
Thanks s lot, Takanori!
Hi, I have confirmed the existence of an
addition of unsigned offset
runtime error at this line (527):int offset = positions[i-1] + sizeS2;
due to invalid index when i is 0. There are a few redundancies in this function as well. https://github.com/3dem/relion/blob/f2e59d6ec61d3f92df31cebbb7402f1012b17a9e/src/strings.cpp#L487line 504:
if (newPos < 0)
could also be problematic and a better approach may be to usestd::string::npos
Crash report
Below is the crash report generated on fuzz testing this function in its current state:
I can submit a pull request that fixes this issue and potentially simplifies the splitting algorithm. Please let me know if you have any questions or would like to fix this in a specific way :)