Closed csmith-rtr closed 1 year ago
@csmith-rtr thx for opening this issue. I ran a quick test using your example program on foxy, galactic and humble/ubuntu22. In all cases a similar invalid pointer stacktrace was observed.
I was able to debug this and determine that our RclTake() function is receiving a RCL_RET_SUBSCRIPTION_INVALID (400) code from the rcl_take() function. We can see in the screenshot that any rcl_take() result code other than OK or SUBSCRIPTION_TAKE_FAILED results in an error being thrown.
I propose we handle the SUBSCRIPTION_INVALID return code similar to SUBSCRIPTION_TAKE_FAILED at line 635 and let it fall thru to return undefined. Thoughts?
update: I made the small change I proposed above to RclTake() and @csmith-rtr example program runs with no failures.
It seems a similar change should be applied to our RclTakeRaw()
function. Our RclTakeRequest()
and RclTakeResponse()
functions are not affected and return undefined
when a rcl returns any return code other than RCL_RET_OK.
I'll open a PR shortly with these changes.
@csmith-rtr I reviewed the 3 scenarios you provided and am able to trace the segfault for node.spin()
to this level atm. A key difference between node.spin()
and node.spinOnce()
is that node.spin()
creates a background uv thread for the node that async handles io. node.spinOnce()
immediately handles available incoming msgs on the main thread. The nextTick
pushes the sub1 deletion and creation outside of the sub2 callback. I suspect if you used a promise in place of nextTick you would also not experience the segfault. Lastly, I need to dig into the rcl code but I seem to recall reading somewhere a warning in RCL to avoid create/destroy entities from within a callback.
I'm curious, do you have a usecase that creates and destroys entities from a callback (e.g., subscription, service, client)?
We have a topic we're subscribing to with some fields that indicate that other resources have either become available or unavailable so we create/destroy subscriptions when that happens. We don't necessarily know ahead of time what topics these resources will publish on. We've had this behavior for a long time but recently added some new topics like this and started noticing the crash - I'm getting around it the same way the example does.
I'm not sure if I've tried this in rclcpp
or other client libraries so I don't know what their behavior is... If you're using an async executor you have no idea when it will necessarily be safe to destroy a subscriber unless you temporarily stop spinning.
@csmith-rtr @wayneparrott sorry for the delay, I will take a look soon, thanks!
Hi @csmith-rtr thanks for your deep investigation of this issue, some initial thoughts
Chatter1 got: Chatter1 0
Chatter2 got: Chatter2 1
Chatter2 got: Chatter2 3
rclnodejs/lib/node.js:191
let success = rclnodejs.rclTake(subscription.handle, message);
^
Error: subscription pointer is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.14/src/rcl/subscription.c:461
at rclnodejs/lib/node.js:191:35
at Node._runWithMessageType (rclnodejs/lib/node.js:1627:5)
at rclnodejs/lib/node.js:188:12
at Array.forEach (<anonymous>)
at Node.execute (rclnodejs/lib/node.js:180:24)
at Node.spinOnce (rclnodejs/lib/node.js:440:11)
Per this error, I can confirm that it's caused by the c pointer of this entity having been deleted, but its corresponding JS object doesn't get marked as invalid. We should leverage kind of flag to mark it.
maybe we can
entity.handle = null;
after it's destroyed and validate the handle
before using it? e.g.,
https://github.com/RobotWebTools/rclnodejs/blob/7fd0ea0083271a6d41a8ee5467e44afe1e5f9bd4/lib/node.js#L191
@minggangw good suggestion that should work. Do you have time to submit a PR? If no I would be glad to impl your solution in the next day or 2.
@minggangw I took a few mins to impl your proposed solution. This revealed a 2nd issue, a SEGFAULT that happens here deep in the stack of adding the handle to the array: https://github.com/RobotWebTools/rclnodejs/blob/7fd0ea0083271a6d41a8ee5467e44afe1e5f9bd4/src/shadow_node.cpp#L118
Using the example program above I believe a handle which has already been released() and set to null in JS is in the vector handles
. I suspect it has been GC'ed. I came to this conclusion by implementing the code below on Entity which runs with no SEGFAULT. Alternatively I could be totally off on this issue. I'll open a PR for review and comment asap.
destroy() {
if (this.isDestroyed()) return;
this._isDestroyed = true;
this.handle.release();
// use setTimtout() instead of setImmediate() to allow the handle to be GC'ed
// later in the event-loop
setTimeout(() => {
this._handle = null;
}, 0);
}
isDestroyed() {
return this._isDestroyed;
}
Thanks!
Description Presumably this occurs destroying any entity, but
rclnodejs
may crash if you destroy a subscription. Example below0.21.4
foxy
20.04
I'm using cyclonedds if that's relevantSteps To Reproduce
Expected Behavior
rclnodejs
doesn't crash when destroying entitiesActual Behavior
rclnodejs
may crash if entities are destroyed. My guess is that both subscriptions have data available and get marked as such. In the first callback the second subscription is destroyed. The executor then tries to execute the callback for a subscription that no longer exists.There are a few comments in the code about small tweaks that change the outcome 1) using
node.spin
- this always ends in a segfault for me2) use
spin(node)
and destroy the subscription withoutprocess.nextTick
- this almost immediately crashes with this error3) use
spin(node)
and destroy the subscription inprocess.nextTick
- this never crashes.