RobotWebTools / rclnodejs

Node.js version of ROS 2.0 client
https://docs.ros.org/en/jazzy/Concepts/Basic/About-Client-Libraries.html#community-maintained
Apache License 2.0
335 stars 72 forks source link

Crash when destroying subscription #896

Closed csmith-rtr closed 1 year ago

csmith-rtr commented 1 year ago

Description Presumably this occurs destroying any entity, but rclnodejs may crash if you destroy a subscription. Example below

Steps To Reproduce

const rclnodejs = require('rclnodejs');

function createSub1(node) {
  return node.createSubscription('std_msgs/msg/String', '/chatter1', msg => {
    console.log(`Chatter1 got: ${msg.data}`);
  });
}

function createSub1And2(node) {
  let sub1 = createSub1(node);

  let sub2 = node.createSubscription(
    'std_msgs/msg/String',
    '/chatter2',
    msg => {
      console.log(`Chatter2 got: ${msg.data}`);
      // process.nextTick(() => {
        // uncommenting the nextTick call (while still destroying/recreating) may change the behavior
        node.destroySubscription(sub1);
        sub1 = createSub1(node);
      // });
    },
  );
}

function spin(node) {
  setInterval(() => {
    node.spinOnce();
  }, 1);
}

async function main() {
  await rclnodejs.init();
  const node = new rclnodejs.Node('test_node');
  // node.spin();  // changing how spin is implemented changes the error
  spin(node);

  createSub1And2(node);

  const pub1 = node.createPublisher('std_msgs/msg/String', '/chatter1');
  const pub2 = node.createPublisher('std_msgs/msg/String', '/chatter2');

  let i = 0;
  setInterval(() => {
    pub1.publish({ data: 'Chatter1 ' + i++ });
    pub2.publish({ data: 'Chatter2 ' + i++ });
  }, 1);
}

main();

Expected Behavior rclnodejs doesn't crash when destroying entities

Actual Behavior rclnodejs may crash if entities are destroyed. My guess is that both subscriptions have data available and get marked as such. In the first callback the second subscription is destroyed. The executor then tries to execute the callback for a subscription that no longer exists.

There are a few comments in the code about small tweaks that change the outcome 1) using node.spin - this always ends in a segfault for me

// ... more message logs above
Chatter2 got: Chatter2 167
Chatter2 got: Chatter2 169
Chatter2 got: Chatter2 171
Segmentation fault (core dumped)

2) use spin(node) and destroy the subscription without process.nextTick- this almost immediately crashes with this error

Chatter1 got: Chatter1 0
Chatter2 got: Chatter2 1
Chatter2 got: Chatter2 3
rclnodejs/lib/node.js:191
          let success = rclnodejs.rclTake(subscription.handle, message);
                                  ^

Error: subscription pointer is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.14/src/rcl/subscription.c:461
    at rclnodejs/lib/node.js:191:35
    at Node._runWithMessageType (rclnodejs/lib/node.js:1627:5)
    at rclnodejs/lib/node.js:188:12
    at Array.forEach (<anonymous>)
    at Node.execute (rclnodejs/lib/node.js:180:24)
    at Node.spinOnce (rclnodejs/lib/node.js:440:11)

3) use spin(node) and destroy the subscription in process.nextTick - this never crashes.

wayneparrott commented 1 year ago

@csmith-rtr thx for opening this issue. I ran a quick test using your example program on foxy, galactic and humble/ubuntu22. In all cases a similar invalid pointer stacktrace was observed.

wayneparrott commented 1 year ago

I was able to debug this and determine that our RclTake() function is receiving a RCL_RET_SUBSCRIPTION_INVALID (400) code from the rcl_take() function. We can see in the screenshot that any rcl_take() result code other than OK or SUBSCRIPTION_TAKE_FAILED results in an error being thrown.

I propose we handle the SUBSCRIPTION_INVALID return code similar to SUBSCRIPTION_TAKE_FAILED at line 635 and let it fall thru to return undefined. Thoughts?

image

wayneparrott commented 1 year ago

update: I made the small change I proposed above to RclTake() and @csmith-rtr example program runs with no failures. It seems a similar change should be applied to our RclTakeRaw() function. Our RclTakeRequest() and RclTakeResponse() functions are not affected and return undefined when a rcl returns any return code other than RCL_RET_OK.

I'll open a PR shortly with these changes.

wayneparrott commented 1 year ago

@csmith-rtr I reviewed the 3 scenarios you provided and am able to trace the segfault for node.spin() to this level atm. A key difference between node.spin() and node.spinOnce() is that node.spin() creates a background uv thread for the node that async handles io. node.spinOnce() immediately handles available incoming msgs on the main thread. The nextTick pushes the sub1 deletion and creation outside of the sub2 callback. I suspect if you used a promise in place of nextTick you would also not experience the segfault. Lastly, I need to dig into the rcl code but I seem to recall reading somewhere a warning in RCL to avoid create/destroy entities from within a callback.

I'm curious, do you have a usecase that creates and destroys entities from a callback (e.g., subscription, service, client)?

image

csmith-rtr commented 1 year ago

We have a topic we're subscribing to with some fields that indicate that other resources have either become available or unavailable so we create/destroy subscriptions when that happens. We don't necessarily know ahead of time what topics these resources will publish on. We've had this behavior for a long time but recently added some new topics like this and started noticing the crash - I'm getting around it the same way the example does.

I'm not sure if I've tried this in rclcpp or other client libraries so I don't know what their behavior is... If you're using an async executor you have no idea when it will necessarily be safe to destroy a subscriber unless you temporarily stop spinning.

minggangw commented 1 year ago

@csmith-rtr @wayneparrott sorry for the delay, I will take a look soon, thanks!

minggangw commented 1 year ago

Hi @csmith-rtr thanks for your deep investigation of this issue, some initial thoughts

Chatter1 got: Chatter1 0
Chatter2 got: Chatter2 1
Chatter2 got: Chatter2 3
rclnodejs/lib/node.js:191
          let success = rclnodejs.rclTake(subscription.handle, message);
                                  ^

Error: subscription pointer is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.14/src/rcl/subscription.c:461
    at rclnodejs/lib/node.js:191:35
    at Node._runWithMessageType (rclnodejs/lib/node.js:1627:5)
    at rclnodejs/lib/node.js:188:12
    at Array.forEach (<anonymous>)
    at Node.execute (rclnodejs/lib/node.js:180:24)
    at Node.spinOnce (rclnodejs/lib/node.js:440:11)

Per this error, I can confirm that it's caused by the c pointer of this entity having been deleted, but its corresponding JS object doesn't get marked as invalid. We should leverage kind of flag to mark it.

minggangw commented 1 year ago

https://github.com/RobotWebTools/rclnodejs/blob/7fd0ea0083271a6d41a8ee5467e44afe1e5f9bd4/lib/node.js#L450-L455

maybe we can

entity.handle = null;

after it's destroyed and validate the handle before using it? e.g., https://github.com/RobotWebTools/rclnodejs/blob/7fd0ea0083271a6d41a8ee5467e44afe1e5f9bd4/lib/node.js#L191

wayneparrott commented 1 year ago

@minggangw good suggestion that should work. Do you have time to submit a PR? If no I would be glad to impl your solution in the next day or 2.

wayneparrott commented 1 year ago

@minggangw I took a few mins to impl your proposed solution. This revealed a 2nd issue, a SEGFAULT that happens here deep in the stack of adding the handle to the array: https://github.com/RobotWebTools/rclnodejs/blob/7fd0ea0083271a6d41a8ee5467e44afe1e5f9bd4/src/shadow_node.cpp#L118 Using the example program above I believe a handle which has already been released() and set to null in JS is in the vector handles. I suspect it has been GC'ed. I came to this conclusion by implementing the code below on Entity which runs with no SEGFAULT. Alternatively I could be totally off on this issue. I'll open a PR for review and comment asap.

destroy() {
    if (this.isDestroyed()) return;

    this._isDestroyed = true;
    this.handle.release();
    // use setTimtout() instead of setImmediate() to allow the handle to be GC'ed
    // later in the event-loop
    setTimeout(() => { 
      this._handle = null;
    }, 0);
  }

  isDestroyed() {
    return this._isDestroyed;
  }
csmith-rtr commented 1 year ago

Thanks!