RPC replies that exceed the buffer size cause undefined behavior

etremel commented 3 years ago

When RPCManager processes a message, it calls the receive_message function with a function parameter named out_alloc that is responsible for allocating memory for the message's reply. This parameter is always a lambda function that will return either a pointer into a reply buffer (e.g. connections->get_sendbuffer_ptr()), or nullptr if the requested amount of memory exceeds the size of a reply buffer. However, when receive_message uses out_alloc (in its own lambda on line 98) it does not check to see whether out_alloc has returned a valid pointer before doing pointer arithmetic on the result:

https://github.com/Derecho-Project/derecho/blob/358b52191d75b5b84f73228af67c92d20b9694a9/src/core/rpc_manager.cpp#L95-L99

This means that if the RPC function referenced by receiver_function_entry tries to create a reply that exceeds the size of the reply buffer, it will get a pointer into arbitrary memory, which will cause undefined behavior. Hopefully it will cause a segmentation fault fairly quickly, but even if it does, the crash might be hard to diagnose.

We should either change the out_alloc lambdas to throw an exception instead of returning nullptr when they are asked for too large of a buffer, or change the receive_message function (and the inner RPC function that creates the reply, i.e. RemoteInvocable::receive_call) to use out_alloc more defensively and assume it could always return a null pointer. Either way, if a user misconfigures a subgroup to have a max_reply_payload_size that is too small for the replies being generated, they should get a clear error, not undefined behavior. Note that we already do this for the sending half of RPC messages, since the p2p_send and ordered_send functions throw an exception if they are asked to send a message that is larger than the maximum message size.

mpmilano commented 3 years ago

The alloc behavior was specified to match the behavior of malloc/calloc/alloca, which also return null in the OOM case. That being said, users commonly ignore this possibility. I vote switching to an exception.

etremel commented 3 years ago

Yes, the idea that the alloc functions return null on failure makes sense because they are malloc-like; I think I remember you explaining that when we first added them. The problem is that our current usage of out_alloc in RPCManager doesn't check for a null pointer before using the return value. If you agree that there's no obvious penalty to throwing an exception from these functions, and there's no particular reason to stick to the malloc interface, I think it would make sense to switch to using an exception to indicate failure. That way we don't have to keep remembering to check the result for null, which people never do anyway even with malloc.

mpmilano commented 3 years ago

C++ exceptions are cheap so long as you're not expecting anyone to catch them.

etremel commented 3 years ago

I've implemented the exception-throwing fix in the branch fix_message_overflow, where I also fixed a bug I discovered in the out_alloc functions themselves (they did not compare against the right buffer size before deciding to return null/throw an exception). However, it turns out that throwing an exception from these allocator functions does not cause a crash in quite the right place. Since the allocator is called within the receive_call function of RemoteInvocable, the buffer_overflow_exception it throws is caught by this try-catch block:

https://github.com/Derecho-Project/derecho/blob/8738763eb5f237fb7fd141f73a564c744c3b72b9/include/derecho/core/detail/remote_invocable.hpp#L294-L309

This means an empty reply with the "exception bit" set is sent back to the calling node, in lieu of the reply that would have been too big for the buffer. When the calling node receives the reply, it will then get an exception when it tries to read the result of the RPC call (from QueryResults), which will cause that node to crash with remote_exception_occurred if it doesn't handle exceptions when reading from QueryResults.

On one hand, this does eliminate the undefined behavior and create a well-defined error condition when a P2P reply is too big for the configured buffer size. On the other hand, I was hoping the error message I wrote ("Size of a P2P reply exceeds the maximum P2P message size") would be displayed in the crash, and this currently won't happen because the actual exception that occurred on the remote node is not conveyed back to the sending node.

Should I try to create a system for reporting more information about the remote exception back to the sending node? Or perhaps change the try-catch block so that it does not catch a buffer_overflow_exception and allows it to crash the receiving node? Or is this fix good enough?

etremel commented 3 years ago

Update: I ended up choosing the "send information about the remote exception back to the client" option, and it works fine as long as the P2P reply buffer size is not ridiculously small. I also added a configuration "sanity check" that reports an error if the user has configured a P2P reply buffer smaller than 128 bytes. These changes have been tested and work correctly, so I'm going to merge this branch to master.

Derecho-Project / derecho

RPC replies that exceed the buffer size cause undefined behavior #198