Behavior Driven Testing Support for verification of production ROS nodes

Ryanf55 commented 10 months ago

ROS 2 Production Task Proposal

Proposal Description:

This proposal is to support testing ROS nodes with behavior-driven testing following behavior-driven development practices.

Essentially, starting from system and integration requirements, you can write gherkin-syntax tests for how a ROS node is supposed to behave. These can be significantly easier to write and develop than current methods, and are much easier for systems engineers to understand that don't want to get lost in the syntax of pytest or gtest.

A good background on this is here.

The scope of the proposal includes:

Perform a down-select on which library to use as the BDD runner
- behave
- pytest-bdd-ng
- others?
Writing a colcon extension to allow colcon test to drive the BDD runner
- Automatic feature discovery
- Generate XML reports
- Supports composition/re-use of test fixtures
Creating or add to an example repository:
- Contains a simple ROS package and node
- Includes a few Gherkin-syntax tests
- Uses colcon to run the tests
- Includes a Gitlab CI that runs the tests and reports results on PR's
- Adding a tutorial to the ROS Tutorials->Intermediate->Testing page on using BDD testing
Example Gherkin Syntax

For the add_two_ints_server.cpp that adds two integers with the AddTwoInts.srv inteface.
- GIVEN the add_two_ints "ServerNode" is running
- WHEN an AddTwoInts request with contents {a: 2, b: 3} on the add_two_ints topic is requested
- THEN an AddTwoInts response is expected with contents {sum: 5} within timeout 1 second(s)
Example colcon syntax

colcon test --packages-select demo_nodes_cpp

Estimated Effort:

2-3 weeks
- Downselect - 3 days
- Colcon extension - 1 week?
- Example Repository - 2 days

Area of Impact:

Testing, Docs, CI/CD

Related Works

Ryanf55 commented 10 months ago

Feedback from the production working group

We should add scope for contributing tests to upstream packages
There may be C++ runners too, could these be explored?
From Dogan: This still doesn't solve the problem of gherkin syntax not being a good mapping. Perhaps this is a good scope for now, but we should also investigate other approaches:
- https://github.com/Copilot-Language/copilot
- https://github.com/NASA-SW-VnV/fret
From Nacho: We should be careful selecting a tool and make sure it's the right one. Think back to python linter selection and how many of the linters are now unmaintained or legacy.
From Alberto: Getting it merged into mainline colcon will take community adoption, especially to be part of colcon-common-extensions. For now, it can be managed in its own repo and exposed to the community in ROS Discourse to get better options.

Next steps:

Let this sit on GitHub for a week to get further input here, then make a ROS discourse post to gather more input.
Research the alternative approaches to gherkin to see if's worthwhile to just drop this and use a different format.
Get resource(s) to develop this and a timeline approved.

doganulus commented 10 months ago

I like to discuss this issue/proposal more generally under the title of declarative and executable test specifications. Gherkin syntax and tools like Cucumber/Behave/SpecFlow are niche examples for that but are mainly developed mostly for web/mobile applications. Robotic applications require more complex specification languages.

Consider an abstract example in Gherkin syntax

GIVEN g1
GIVEN g2
WHEN w1
THEN t1

This scenario essentially express the Boolean logic formula g1 && g2 && w1 -> t1 where g1, g2, w1, and t1 are predicates (boolean-valued functions). Nice and more readable but hardly a significant technical advantage over standard unit testing. These predicates still need to be defined in code, called step executors. The tool should execute these steps and check the example. There are a few points when bringing these concepts into robotics.

First, the concept of time. Originally and widely used in web applications, the step execution assumes a very sparse and often event-based interaction with the environment. However, many robotic applications require much denser interaction, especially at lower levels. This means that all Gherkin predicates, such as GIVEN ServerNode is running, should ideally be checked at every single point over the timeline. Existing BDD executors are not designed for such temporal use cases. At least, I am not aware of any mature implementation doing that. So the checkers must be aware of the temporal nature and must be able to operate on data streams. For ROS, this may mean that the executor should be implemented as an observer/monitor/checker ROS node from the specification. We study these applications under the field of (specification-based) runtime verification. Gherkin-like specifications can be useful in principle but should be at least enriched with temporal/timing constraints. The example in the original post already contains a within <time-interval> keyword. When the interaction is dense, timers to check such constraints become too complicated and inefficient.
Second, a natural language may be too ambiguous to specify complex real-time properties. It works adequately for web/mobile apps, but it is harder to explain robotic system requirements in unrestricted natural languages. Therefore, safety standards like ISO26262 usually suggest the use of structured English, semi-formal, and/or formal specifications (thus logical formulas). I like Gherkin's structured way, and we should ensure that any specification can be translated into an equivalent logical formula. These formulas must be defined over atomic predicates (for ROS, topic names are atomic functions). So I think GIVEN /turtle/server_node/ is "RUNNING" may be better and more automatizable compared to proper English grammar.
Third, example-based tests (input-output test pairs) are never enough for complex real-time systems. Normally, BDD or Gherkin do not dictate example-based tests, but that's the most common case in practice. A more beneficial approach for complex systems is specification-based tests (defining rules between input and output). This approach is also called property-based testing and is implemented in some tools like QuickCheck/Rapidcheck/Hypothesis, but these tools also lack the temporal aspect, as explained previously.

For the next steps, I first would like to collect behavioral specifications from real industrial studies. These specifications can be in plain English. Companies are sometimes reluctant to share their system requirements for various reasons. Still, building a repository of example robotic requirements would be great, especially at the system level.

Ryanf55 commented 8 months ago

For the next steps, I first would like to collect behavioral specifications from real industrial studies. These specifications can be in plain English.

Great ideas here on an alternative approach. Were you willing and able to collect some of these industrial studies and share them here?

For reference, we are currently using python behave for a couple of hundred test steps and it's working ok.

doganulus commented 8 months ago

I always wished to collect more actual system requirements from the industry as collecting diverse requirements is helpful for tool developers. At the basic level, these requirements should look like such sentences:

If the value in /namespace/topic_name is greater than 0, then /namespace/another_topic_name will be greater than 5 within 10 milliseconds.
Given /namespace/topic_name equals to ENUM_VALUE and /namespace/another_topic_name is greater than 20, then /namespace/topic3 is always true.

Also I wonder how you would compare a property-based testing tool like Hypothesis with your Behave use cases?

bpwilcox / ros2-production-working-group