algesten / str0m

A Sans I/O WebRTC implementation in Rust.
MIT License
317 stars 49 forks source link

`IceAgent`'s `poll_timeout`and `handle_timeout` disagree by `TIMING_ADVANCE` #514

Closed thomaseizinger closed 4 months ago

thomaseizinger commented 4 months ago

str0m currently has a hard-coded value of 50ms that will apply to several return paths of poll_timeout. This means, there is a discrepancy between poll_timeout and handle_timeout.

If I add candidates to the IceAgent it will form pairs and it will send STUN requests to those after calling handle_timeout. However, poll_timeout adds an additional 50ms to the last known now value, meaning 50ms first have to pass before handle_timeout will actually start sending STUN requests.

Based on the docs, it appears that these 50ms actually come from the ICE spec but it also says that we MAY use a different value. Would there be a problem in removing these 50ms?

This problem came up because I am currently writing property-based tests for our entire state machine (https://proptest-rs.github.io/proptest/proptest/state-machine.html) and there, I am advancing time using a separate state transition from the packets that are being sent. In this test, I have a loop that exhausts the current state machine which in this case doesn't work: We get stuck half-way through establishing a connection because we'd need to advance time by an additional 50ms before str0m would start sending STUN requests.

algesten commented 4 months ago

Hi @thomaseizinger (welcome back! 😉)

So far, the 50ms seems to work well for our real-world use cases. If I understand this correctly, the problem arises in test cases, rather than real world scenarios?

I'm happy to take a PR that makes the timing advance configurable. I don't know what happens if you set that to 0, but if you make it configurable, you can of course experiment without affecting anything else.

thomaseizinger commented 4 months ago

Hi @thomaseizinger (welcome back! 😉)

Thank you! :)

So far, the 50ms seems to work well for our real-world use cases. If I understand this correctly, the problem arises in test cases, rather than real world scenarios?

It manifests in the tests yes. In reality, 50ms are so short that it probably doesn't make a difference.

What I don't really understand is what it achieves? My expectation would be that, if handle_timeout has "work" to do, then poll_timeout should return last_now and not last_now + 50ms. What do we gain from deferring things we could be doing immediately by another 50ms?

I'm happy to take a PR that makes the timing advance configurable. I don't know what happens if you set that to 0, but if you make it configurable, you can of course experiment without affecting anything else.

That sounds good, I'll send one!

algesten commented 4 months ago

What I don't really understand is what it achieves?

I guess the Ice Agent was among the first things I implemented in str0m. With time being driven from outside, I remember having a thought that I needed time to always go forward. This has been abandoned in the rest of the code base however.

Be interesting to know what happens if you set it to 0.

thomaseizinger commented 4 months ago

What I don't really understand is what it achieves?

I guess the Ice Agent was among the first things I implemented in str0m. With time being driven from outside, I remember having a thought that I needed time to always go forward. This has been abandoned in the rest of the code base however.

Be interesting to know what happens if you set it to 0.

I ran all the tests with 0 and they pass.

From my own experience, it is easy to create bugs with the SANS-IO model where poll_timeout keeps returning the same instant even after handle_timeout is called. That then leads to a busy-loop of the eventloop and everything gets stuck. Such a bug gets avoided / masked if we always increment time in at least 50ms steps.

I don't think we need it though. I don't have anything like that in our system and the amount of sans-io code there keeps growing :)

I'll set it to 0 using the linked patch and report back. If it doesn't cause any issues, we can perhaps delete it?

algesten commented 4 months ago

I'll set it to 0 using the linked patch and report back. If it doesn't cause any issues, we can perhaps delete it?

Yeah, maybe. I'd need to test this in str0m (rtc) to ensure we don't rely on this being the main "driver of time forward". That responsibility could of course move up the stack (probably into Rtc).