Better handling of RNodeState::Running

samouwow commented 4 months ago

Kudos on the work you've done here, this library has heaps of really useful features.

I've run into a pretty major roadblock that I've traced back to how the library handles RNodeState::Running during the ticking of both decorator and flow node types, and ultimately leaf nodes too.

To the best of my investigation abilities, the root causes seem to boil down to:

Decorators don't consider their previous state when preparing themselves in the Ready state
Sequences and Fallbacks don't reset their P_CURSOR when returning a non-Running state
Reactive sequences and fallbacks don't alter their children if they're interrupted while still running

The below sections outline some of the specific issues these present

I'm imaging this issue could be used to track a number of PRs, which I'll pre-emptively summarise here:

67
- [x] timeout doesn't reset on running
- [x] timeout uses milliseconds
- [x] delay doesn't reset on running
- [x] retry and repeat reset when not returning running
- [x] sequence and fallback reset P_CURSOR when returning not-running
- [x] Partially addresses Investigate running handling of m_sequence and parallel (m_sequence updated)
69
- [x] timeout resets when halted
- [x] delay resets when halted
- [x] retry and repeat reset when halted
- [x] sequence and fallback reset P_CURSOR when halted
- [x] Investigate running handling of m_sequence
- [x] Implement halting in the main tick logic
- [x] Update sync actions to include a halt function
Not yet address by PR
- [ ] delay doesn't block reactive checks
- [ ] Investigate running handling of parallel
- [ ] Update async actions to include a halt function
- [ ] Update remote actions to include a halt function

Decorators

Timeout

Timeout recalculates the timeout system clock stamp every time prepare() is called, which means unless the timeout is 0 or 1, so that it fails immediately (i.e. curr - start >= timeout is true), it will never fail.

This can be seen in the below example tree. success() should be called twice, but instead is called 5 times, as timeout never expires.

root main {
    timeout(2) repeat(5) delay(1000) success()
}

I'd propose timeout considered its return value for the previous tick, and if that was RUNNING it didn't recalculate the timeout system clock stamp. The exception to this would also be to reset the timestamp if it was interrupted, see the section on reactive flow below.

On an unrelated note, the documentation for timeout says it uses milliseconds, but it actually uses seconds. My personal vote would be to update it to use milliseconds, since seconds are a little coarse.

Delay

Delay works during the prepare() function as well, but doesn't consider its previous value, so it runs every time it's visited.

This can be seen in the below example, where the tree should take about 1 second to complete, but instead takes 5.

root main {
    delay(1000) retry(5) fail_empty()
}

I'd propose that delay doesn't do anything in prepare() if it previously returned RUNNING. It should also reset if halted, see below.

Additionally, delay is implemented with a thread sleep, which blocks reactive sequences and fallbacks from interrupting it. I don't have an example for this, but it should be pretty straight forwards to recreate.

I'd propose delay recorded a timestamp like timeout and didn't pass on to its child until that timeout was reached. It could return running in the meantime.

Reset and Retry

Reset and retry both have a similar issue, wherein they don't explicitly set their attempt counts to 0. (They also seem to use len as their attempt counter for 1, which seems to accidentally work since len is always 1). This means that a retry will only work as expected the first time. If you retry-a-retry, every cycle after the first will only execute one loop.

For example, the below loop should call fail_empty() 9 times, but actually only does it 5 times.

root main {
    retry(3) retry(3) fail_empty()
}

Behind the scenes, this is because you're seeing:

retry 1
   - retry 1
   - retry 2
   - retry 3
retry 2
   - retry 3
retry 3
   - retry 3

I'd propose retry and reset explicitly set their counters to 0 during prepare(), but only if they didn't return RUNNING last tick. This should also be reset if they're halted.

Flow

Sequences and Fallbacks

Sequences and fallbacks never reset their P_CURSOR, so a repeat of a sequence that had children that returned running will forever restart from that running child.

The below should call success 8 times, or 6 times accounting for the reset-repeat bug mentioned above. However, it only calls success 5 times, since the sequence restarts on the last child, as P_CURSOR is never reset. Note the repeat(2) success() is used to generate a RUNNING status to trigger sequence to set its P_CURSOR.

root main {
    repeat(2) sequence {
        success()
        success()
        repeat(2) success()
    }
}

I'd propose the P_CURSOR gets reset if sequence or fallback return a non-running result. That way you'd enter them "fresh" if you try to repeat or retry them. They should also reset their P_CURSOR if halted, see below.

M_Sequence and Parallel

I haven't investigated these, but since I can't find anywhere that explicitly sets P_CURSOR to 0 I assume these guys have similar problems.

Halting running nodes on an interrupt

As touched on above, all of these nodes will behave unexpectedly if they're interrupted by a reactive sequence or fallback node.

This is hard to demo with the above bugs present, but consider the below tree:

root main {
    retry(2) r_sequence {
        return_success_then_fail_then_successes_there_after()
        repeat(2) success()
    }
}

By my reasoning, success() should be called three times, but would only be called 2. This is because repeat was never notified that it was interrupted by the reactive sequence, so on the second retry it picks up from where it left off, rather than restarting its repeat count from 2.

The solution to this seems pretty involved. All nodes need the concept of being halted, so that if they're currently RUNNING they tidy themselves up neatly and start properly next time. Behavior Tree CPP has this concept, so hopefully it could be used for inspiration on how to implement this.

Leaf / user nodes

This concept of halting should be extended to user implemented nodes, so that their logic can clean themselves up if they too are interrupted.

I haven't thought too much about how to implement this, potentially via a new halt function next to tick in the impl trait. This could have a default implementation so that existing code / synchronous nodes didn't need to worry about it.

besok commented 4 months ago

Thank you for the awesome description. I will take a look later at the weekend.

samouwow commented 4 months ago

Thanks for the quick merging of that first PR.

Main still isn't building on my machine, is it building for you?

besok commented 4 months ago

It is building on my machines (I checked on Linux, Mac and Windows). The problems were on the Windows machine but now it works. Can you describe a problem?

samouwow commented 3 months ago

I've opened an issue with the cargo output of the failing build to avoid polluting this issue.

Please see #68

forester-bt / forester