Closed etremel closed 3 years ago
The alloc behavior was specified to match the behavior of malloc/calloc/alloca, which also return null in the OOM case. That being said, users commonly ignore this possibility. I vote switching to an exception.
Yes, the idea that the alloc functions return null on failure makes sense because they are malloc-like; I think I remember you explaining that when we first added them. The problem is that our current usage of out_alloc
in RPCManager doesn't check for a null pointer before using the return value. If you agree that there's no obvious penalty to throwing an exception from these functions, and there's no particular reason to stick to the malloc interface, I think it would make sense to switch to using an exception to indicate failure. That way we don't have to keep remembering to check the result for null, which people never do anyway even with malloc.
C++ exceptions are cheap so long as you're not expecting anyone to catch them.
I've implemented the exception-throwing fix in the branch fix_message_overflow, where I also fixed a bug I discovered in the out_alloc
functions themselves (they did not compare against the right buffer size before deciding to return null/throw an exception). However, it turns out that throwing an exception from these allocator functions does not cause a crash in quite the right place. Since the allocator is called within the receive_call
function of RemoteInvocable, the buffer_overflow_exception it throws is caught by this try-catch block:
This means an empty reply with the "exception bit" set is sent back to the calling node, in lieu of the reply that would have been too big for the buffer. When the calling node receives the reply, it will then get an exception when it tries to read the result of the RPC call (from QueryResults), which will cause that node to crash with remote_exception_occurred if it doesn't handle exceptions when reading from QueryResults.
On one hand, this does eliminate the undefined behavior and create a well-defined error condition when a P2P reply is too big for the configured buffer size. On the other hand, I was hoping the error message I wrote ("Size of a P2P reply exceeds the maximum P2P message size") would be displayed in the crash, and this currently won't happen because the actual exception that occurred on the remote node is not conveyed back to the sending node.
Should I try to create a system for reporting more information about the remote exception back to the sending node? Or perhaps change the try-catch block so that it does not catch a buffer_overflow_exception and allows it to crash the receiving node? Or is this fix good enough?
Update: I ended up choosing the "send information about the remote exception back to the client" option, and it works fine as long as the P2P reply buffer size is not ridiculously small. I also added a configuration "sanity check" that reports an error if the user has configured a P2P reply buffer smaller than 128 bytes. These changes have been tested and work correctly, so I'm going to merge this branch to master.
When RPCManager processes a message, it calls the
receive_message
function with a function parameter namedout_alloc
that is responsible for allocating memory for the message's reply. This parameter is always a lambda function that will return either a pointer into a reply buffer (e.g.connections->get_sendbuffer_ptr()
), ornullptr
if the requested amount of memory exceeds the size of a reply buffer. However, whenreceive_message
usesout_alloc
(in its own lambda on line 98) it does not check to see whetherout_alloc
has returned a valid pointer before doing pointer arithmetic on the result:https://github.com/Derecho-Project/derecho/blob/358b52191d75b5b84f73228af67c92d20b9694a9/src/core/rpc_manager.cpp#L95-L99
This means that if the RPC function referenced by
receiver_function_entry
tries to create a reply that exceeds the size of the reply buffer, it will get a pointer into arbitrary memory, which will cause undefined behavior. Hopefully it will cause a segmentation fault fairly quickly, but even if it does, the crash might be hard to diagnose.We should either change the
out_alloc
lambdas to throw an exception instead of returningnullptr
when they are asked for too large of a buffer, or change thereceive_message
function (and the inner RPC function that creates the reply, i.e.RemoteInvocable::receive_call
) to useout_alloc
more defensively and assume it could always return a null pointer. Either way, if a user misconfigures a subgroup to have amax_reply_payload_size
that is too small for the replies being generated, they should get a clear error, not undefined behavior. Note that we already do this for the sending half of RPC messages, since thep2p_send
andordered_send
functions throw an exception if they are asked to send a message that is larger than the maximum message size.