Healthy connections get destroyed during ice candidate negotiation

welldan97 commented 4 years ago

Looks like this line here sometimes destroys even healthy connections!

https://github.com/feross/simple-peer/blob/cb73f00eca78054fe57047b2bf12eb2560f25ed3/index.js#L655

It happens when connection temporary fails and ICE is still negotiating candidates.

A little intro

1) Connection in "failed" state stands for:

One or more of the ICE transports on the connection is in the "failed" state. (https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection/connectionState)

But it actually also happen when iceConnectionState is "disconnected".

2) iceConnectionState in "disconnected" state means:

Checks to ensure that components are still connected failed for at least one component of the RTCPeerConnection. This is a less stringent test than "failed" and may trigger intermittently and resolve just as spontaneously on less reliable networks, or during temporary disconnections. When the problem resolves, the connection may return to the "connected" state. (https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection/iceConnectionState)

And what I've realized right now, is that actually even "failed" iceConnectionState can be recovered. So may be this line could be also somehow improved as well: https://github.com/feross/simple-peer/blob/cb73f00eca78054fe57047b2bf12eb2560f25ed3/index.js#L676

How to reproduce

So I used chrome & safari browsers. And checked connection from 2 macbooks, one connected to 3G another to wifi. I pass signals(offers/answers) manually via messenger

I've used turn& stun servers from Twilio as configuration

Ok. So first I create Peer with initiator:true, trickle: false, on first mac. I get offer with candidates.

On another mac then I create Peer with initiator: false, trickle: false. And then I pass offer this.peer.signal(offer). If I'll pass answer back to initiator fast enough, connection get created. But I wait for around 15 seconds & then connection get destroyed, because iceConnectionState switches to "disconnected" or to "failed". And that's what is fishy. The interesting thing is that if I comment out mentioned lines then I am able to create connection even when it's in "failed" state.

Thoughts

So this makes me think that "failed" or "disconnected" states for iceCandidates and "failed" state for connection itself don't mean too much. It's not a failure yet, I guess, since it can be reestablished so easily, it's more "on hold" type of state.

I'm not really sure how the issue can be fixed, though since I'm pretty new to WebRTC myself. And all these states are confusing. It needs further experimentation

Thank you!

nazar-pc commented 4 years ago

Can you try to replicate this between Chrome/Chrome Chrome/Firefox Firefox/Firefox? I have a sneaking suspicion this is one more quirk in Safari implementation, but would be nice to have more data.

welldan97 commented 4 years ago

Ok! So I've tried chrome to chrome connection. And it behaves almost the same. Also fails if to wait too long. And it can be connected if comment out these destruction lines.

This time I have different errors though.

So for chrome(initator) <-> chrome connection the error is: [18997b5] destroy (error: Connection failed.)

The state right before the fail:

{
  iceGatheringState: "complete",
  iceConnectionState: "disconnected",
  connectionState: "failed",
  signalingState: "stable"
}

For chrome(initiator) <-> safari the error is:

[4c824a5] destroy (error: Ice connection failed.)

The state before fail:

{
  iceGatheringState: "complete",
  iceConnectionState: "failed",
  connectionState: "connecting",
  signalingState: "stable"
}

I saw both these problems in chrome/safari pair though, as per issue.

welldan97 commented 4 years ago

I've made a fork with a condition for this._connected https://github.com/welldan97/simple-peer/commit/7a8cbe28d17fd5f4d79cc24bc64ffa321c802b29

This works for me, so far. I bet this would be also better for many cases.

But probably it doesn't cover all the cases good enough: now I think that even once connected it could reach these strange recoverable states.

nazar-pc commented 4 years ago

The latest information I know is that SDP is only valid for about 30 seconds if the spec didn't change. So you can't really store it and if you wait for too long connection will fail and this is expected.

@t-mullen do you remember why those were added? It is possible there was some kind of quick in one of the browsers that it was necessary to fail the whole connection.

Usually though this will happen automatically and you can retry quickly, so shouldn't be a big issue. But the problem is interesting nonetheless.

Also did you verify that connection actually works, meaning you can send audio, video or other data over it?

t-mullen commented 4 years ago

connectionState == 'failed' implies iceConnectionState == 'failed'. These signal fatal errors in one of the underlying ICE transports. Some parts of your connection may still work (since you can have multiple ICE transports). but the overall peer is broken.

These states are equivalent. The reason we listen for both is due to a bug in Chrome causing the oniceconnectionstatechange event not being fired.

connectionState == 'disconnected' is different and isn't fatal. We don't tear the peer down for this. This state doesn't seem relevant to your issue.

Are you sure the connection is usable when iceConnectionState == 'failed'?

t-mullen commented 4 years ago

ICE failures are sometimes recoverable using an ICE restart, a procedure not implemented in simple-peer yet. See #579

Performing an ICE restart is recommended when iceConnectionState transitions to "failed".

If you can confirm the peer datachannel and any audio/video streams still work after ICE failure, I can look more into this.

welldan97 commented 4 years ago

@nazar-pc So today I've made some experiments.

Unfortunately this time 4g Mac could not connect to wifi Mac at all. That was strange.

Then I've tried to connect to my friend in Russia. So the set up was like this:

Initiator: Me(Chrome, Macbook, Portugal) Opponent: My friend(Chrome, Macbook, Russia)

So we tried several experiments, each time same scenario: 1) I initiate connection, 2) I send offer via messenger. 3) And then he sends me answer back 4) I wait for X min 5) and then I establish connection. 6) try sending messages via chat

These are the results:

X/delay - result:

≈0 min - success, ≈0.5 min - success, ≈1 min - success, ≈2 min - failure, ≈2.5 min - failure, ≈5 min - failure,

I've tried each delay only once.

Another thing I've tried is waiting before sending the offer. I've waited for 2 mins after it creation and it was still successful connection.

Thanks for inputs @t-mullen . I'm happy to learn more about WebRTC, and simple-peer As for now, though, I'm more interested in establishing connection first.

So regarding this, to summarize. According my experiments Peer.js kills connection in around 15 seconds, when the connection officially dies. But if peer.js would not kill a connection,a user could still recover connection if they connect within 1 min. And that's a huge difference when one send signaling manually via messenger!

I see that with signaling server it's a no issue at all. But, I'm building an experimenting project, and I want to see how far the p2p can be pushed with WebRTC. I want to have an ability to establish connection manually - this way one need even less servers.

I think that could be useful to other users as well, may be as opt-in.

What do you think?

t-mullen commented 4 years ago

Having long delays in signalling is not something that can be done reliably. The ICE candidates contained in the signal messages are time-sensitive.

The reason for this isn't due to simple-peer or even WebRTC, but your router. Most routers have timeouts on address mappings, which means the connection information in the server-reflexive ICE candidates expires.

https://en.wikipedia.org/wiki/Network_address_translation https://www.rfc-editor.org/rfc/pdfrfc/rfc4787.txt.pdf

The only guarantee the spec gives you is this (keep in mind not every router follows the spec).

A NAT UDP mapping timer MUST NOT expire in less than two minutes.

Although it would be nice if we could just connect whenever we wanted with an IP and port, the reality of NATs means that's impossible in the general case.

fippo commented 4 years ago

Most routers have timeouts on address mappings, which means the connection information in the server-reflexive ICE candidates expires.

WebRTC implementations should periodically ping the STUN server to keep the mapping open, see https://tools.ietf.org/html/rfc5389#section-6. If that isn't the case, please file a bug against the implementation!

michaelpalumbo commented 4 years ago

Most routers have timeouts on address mappings, which means the connection information in the server-reflexive ICE candidates expires.

WebRTC implementations should periodically ping the STUN server to keep the mapping open, see https://tools.ietf.org/html/rfc5389#section-6. If that isn't the case, please file a bug against the implementation!

I'm also experiencing the issue that @welldan97 has laid out. If I wanted to ping the STUN server in the simple-peer implementation, how would I do so?

kk-007 commented 4 years ago

it looks funny but i have the same issue and i solved it by changing internet connection. when i switch to mobile data from a broadband project run well and when i again switch back to broadband it gives that error. my project: https://github.com/kk-007/webrtc-video-call

feross / simple-peer