ICube-Robotics / ethercat_driver_ros2

Hardware Interface for EtherCAT module integration with ros2_control
https://icube-robotics.github.io/ethercat_driver_ros2/
Apache License 2.0
125 stars 32 forks source link

Fix Startup Crash Caused by Read() and Write() on Un-activated Ethercat #120

Closed tony-laza closed 1 month ago

tony-laza commented 2 months ago

On our system (multiple CiA 402 motor drivers), we have found there are frequent crashes (at least 25% of the time) at startup caused by accesses to the domain_info_ map before it is populated. We traced this to ros2_control calling the read() and write() functions on the ethercat interface before on_activate() completed. This PR is our fix for this, which applies:

  1. Implement a lock and introduce the activated_ variable into the hardware interface class, preventing access before activation.
  2. Have the lower-level activate() function return a boolean false if it fails to succeed, and propagate this back to ros2_control.
  3. Use at() instead of direct access for the domain_info_ map so it will throw an exception on out-of-bounds access instead of undefined behavior.

I'm happy to re-write this as the team sees fit, or drop this if it is unhelpful.

yguel commented 1 month ago

Many thanks for your report and your PR. At first read, your fixes sound good. I am currently reviewing the PR. My plan is to devise a unit test that reproduces the problem to validate the solution.

tony-laza commented 1 month ago

I have looked into writing a test for this, I think it would have to be hardware in the loop. There doesn't seem to be existing software-only tests that start a hardware_interface. I have started looking into if this is possible with the etherlab codebase.

yguel commented 1 month ago

I have looked into writing a test for this, I think it would have to be hardware in the loop. There doesn't seem to be existing software-only tests that start a hardware_interface. I have started looking into if this is possible with the etherlab codebase.

We have work in progress towards unit testing:

  1. a real hardware platform (we can expect it running tests during summer)
  2. a mock ethercat master to simulate data exchange (we can expect it ok at end of may)
  3. a simulation of the ethernet card (can test the interactions with the ethercat master, but not the ethernet communications)
  4. a simulation of ethercat slaves (we have no realistic schedule yet, in particular we have no good solution yet to simulate an ethercat slave) that will run tests as part of the CI/CD.

But I was not counting on those to create the unit test I had in mind. I am planning to mock anything to make the ethercat_driver achieve «inactive» state (pass on_init() and then configure()) and then call read, write, prepare_, perform_, etc. on it. That is before the activate() call is performed. It will obviously fail at the current state of the code since the ethercat communication would not be enabled (that is also why this test does not really require hardware). Unfortunately, this will not really testing your case, however it should pass with your PR.

In the meantime, I will merge your PR (you can expect the merge to be done by the end of the week).

The problem, you had, shows however a bigger problem of our design. It shows that our design does not follow the hardware interface lifecycle (https://control.ros.org/master/_images/hardware_interface_lifecycle.png)

Read and write calls are legit after the on_configure stage, just «no movement» should happen. Since for the moment we have neither the concept of a «motion enabled» nor the concept of a «motion disabled» state for our ethercat slaves, it seemed more «simple» to setup the ethercat communication in the activate() method, but we should change that.

Our proposal is to have something that emulate a write_inactive() and a write_active() method for each slave that will be called by the write() method according to the hardware interface state.

For the generic slave interface, we could add a new parameter to the channels providing values for the writes in the inactive mode:

As it will break the existing mode of operation, however being more respectful of the ros2 control specifications, we will increment the major version of the lib, making it an api breaking change.