HMIProject / open62541

9 stars 7 forks source link

Unsubscribing from server node in UA Expert / UPC UA Client causes segmentation fault #125

Open LeanderGlanda opened 2 months ago

LeanderGlanda commented 2 months ago

I just found a bug, which is either caused by the open62541 server itself or this crate. As the segmentation fault happens in the crate's code, this is probably the right place.

How to reproduce the error?

If you add multiple subscriptions to the node, then only the last removal will cause the segmentation fault.

I already spent a while debugging the issue. The segmentation fault is caused in src/Server.rs > build > destructor_c. The log message will be printed with the node id, and then the segmentation fault happens, likely by the NodeContext::consume(node_context) call.

In the attached log, you can see that before the Rust code runs, the DeleteSubscriptionsRequest gets processed. I looked in the open62541 source and found out, that the memory pointed to by the node pointer gets freed, but it seems that the pointer isn't set to null. This probably causes the Rust code to think that the pointer isn't null and then it tries to free the memory a second time.

Temporary solution: Remove the NodeContext::consume(...). This gets rid of the segmentation fault, but may cause a memory leak somewhere else.

important_part_of_the_log.txt

LeanderGlanda commented 1 month ago

The destructor_c() in src/server.rs->build() seems to be the problem. When debugging, I saw the following: SIGNAL SIGSEGV: ADDRESS NOT MAPPED TO OBJECT (FAULT ADDRESS: 0X0) To reproduce, probably just run an example with a debugger attached.

Just wanted to add this as I just saw it during debugging my own changes.

sgoll commented 1 month ago

Thank you, I was able to reproduce the issue. I'll start working on a fix.

sgoll commented 1 month ago

It seems that our destructor callback is called for unrelated nodes that have node_context set. I was under the impression that only explicitly constructed nodes with a custom node_context (e.g. UA_Server_addDataSourceVariableNode()) get this argument set in the destructor. I'll have to investigate further.

sgoll commented 1 month ago

For reference, I couldn't figure out where the unexpected node_context was coming from. Along with many nodes that are part of the default server node tree (Types/, Views/, Objects/Server/), there are nodes that are created and destroyed when client subscriptions are established and torn down. These carry unexpected node_context but I do not know if there is a way to influence this or identify those nodes some other way.

See https://github.com/open62541/open62541/issues/6598.

mati865 commented 1 month ago

I think I can help you a bit with the cause of this issue. In our case we are creating the server with function UA_Server_newWithConfig and I can see you do it the same way. So the problem is: when you create server's configuration on the stack and call UA_Server_newWithConfig you'd expect config struct to be safe to drop at the end of the scope (that would be quite sensible, wouldn't it?). Then to your surprise a wild segfault occurs when a client opens and closes a connection (at least in our case, here OP apparently has to subscribe and unsubscribe). The problems begin when you call UA_Server_newWithConfig but remain hidden from the sight. This function will perform full copy of the configuration (including pointers) but won't update all the pointers as it can be seen here: https://github.com/open62541/open62541/issues/5778 or https://github.com/open62541/open62541/issues/5675 (there are more cases). Now pair it with the fact the configuration structure contains pointers to itself and you get dangling pointers that turn into use-after-free as soon as you drop config structure from the stack.

In our case (with home-grown bindings to open62541) it was easier to notice because we are using original open62541 logger and I observed logs being present in the code but not in the output, also the crash in our case was happening in the logger. So I started with debugging the logger and found missing logs to be called with context pointing to NULL (UA_Server_newWithConfig zeroes original config after reading it) and the crashing log invocation with context pointer that surely wasn't pointing to logger context.

Once I realised this must be use after free, I allocated the config on the heap and leaked it (Box::leak) and unsurprisingly the server is no longer crashing. Some of the logs that are called with context pointing to the original config allocation are still absent but they weren't important anyway.

I suppose your issue might have similar cause, maybe it's dangling pointer from config as well? In that case leaking the config and reverting https://github.com/HMIProject/open62541/pull/137 shouldn't crash. If it still does that will likely be a different dangling pointer.

sgoll commented 1 month ago

@mati865 Thank you for your input, it is greatly appreciated.

I gave your suggestion a try; I reactivated the call to NodeContext::consume() that was disabled in #137, and I leaked the config onto the heap using ~Box::new()~ Box::leak() before passing it to UA_Server_newWithConfig().

Unfortunately, it did not solve the issue. I had looked deeper into the issue; it turns out we're trying to free pointers that we have not put into the node_context in the first place. This was confirmed in https://github.com/open62541/open62541/issues/6598#issuecomment-2243007802.

So this is not so much a matter of dangling pointers, but trying to use data (and interpret it as a pointer to a data structure) that does not match our expectations.

We will still need to keep an eye open for the missing deep copies in UA_ServerConfig that you mentioned in your post.[^1]

[^1]: For now, we don't use the data structures that are involved, though.

mati865 commented 1 month ago

I leaked the config onto the heap using Box::new()

You probably meant Box::leak() as Box::new() doesn't leak the memory.

Anyway, good to know. I think you might encounter issues with dangling pointers from the config if you configure security and certificates, but I debugged it a long time ago, so I don't remember the details.

sgoll commented 1 month ago

You probably meant Box::leak() as Box::new() doesn't leak the memory.

Yes, obviously. Sorry for the misleading typo 😓