awslabs / amazon-kinesis-video-streams-webrtc-sdk-c

Amazon Kinesis Video Streams Webrtc SDK is for developers to install and customize realtime communication between devices and enable secure streaming of video, audio to Kinesis Video Streams.
https://awslabs.github.io/amazon-kinesis-video-streams-webrtc-sdk-c/group__PublicMemberFunctions.html
Apache License 2.0
1.04k stars 313 forks source link

[QUESTION] TURN candidates never work with ICE_TRANSPORT_RELAY setting #1851

Closed mehmetcy84 closed 9 months ago

mehmetcy84 commented 11 months ago

We are using this repo and want to ensure connection stability via usage of TURN candidates for worst case scenario. But TURN candidates never work. If ICE_TRANSPORT_RELAY is chosen in samples/Common.cpp instead of ICE_TRANSPORT_ALL, connection is never established.

We have been using Master 1.7.3. And in some of our networks, webtrc connections are achieved while in others it is not. In order to ensure maximum stability, we would like to make sure TURN candidates are used for the last line of defense. But even in our networks where STUN works, TURN candidates never work. We have observed this on Windows builds as well as cross compiled Android builds. Updated our repos to the latest master 1.9, but the problem persists.

In order to ensure we didn't introduce any artifacts, I have cloned the 1.9 master, only applied the following two changes to avoid build errors, other than that the following logs are captured on vanilla kvs sample (kvsWebRTCClientViewer.exe) built for Windows run against the KVS WebRTC Test Page.

Steps to reproduce:

  1. Clone the repo (kvs Master v1.9) in Windows 10
  2. Change the following line in /CMake/Dependencies/libwebsockets-CMakeLists.txt
    • -DLWS_WITH_MINIMAL_EXAMPLES=1 // default causing build error
    • -DLWS_WITH_MINIMAL_EXAMPLES=0 // modified
  3. Change the following line in /CMakeLists.txt -option(ENABLE_AWS_SDK_IN_TESTS "Enable support for compiling AWS SDKs for tests" ON) +option(ENABLE_AWS_SDK_IN_TESTS "Enable support for compiling AWS SDKs for tests" OFF)
  4. Run ..github\build_windows.bat
  5. Set up the Kinesis WebRTC Test Page with region, creds, channel name. Make sure "Send Audio" is ticked. (The viewer sample default gets and reports audio frames)
  6. Run the kvsWebRTCClientViewer.exe with the channel name.

Expected Result: "Audio Frame received" prints displayed. Actual Result: "Audio Frame received" prints displayed.

  1. In /samples/Common.cpp InitializePeerConnection function, make the following change: -configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_ALL; +configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_RELAY;
  2. Run the kvsWebRTCClientViewer.exe with the channel name.

Expected Result: "Audio Frame received" prints displayed. Actual Result: Frames are not received, and the console is flooded with TURN error messages.

The following logs are captured for "VERBOSE" log setting. Sensitive info censored. For clarity additional logs are manually added (ones which start with "mehmetcy") in turnConnectionTimerCallback function, where the errors are seen. logs.txt

niyatim23 commented 11 months ago

Hi @mehmetcy84, we have a Windows CI in our repository now. Please refer to the steps here so that you can build with PThreads without modifying anything related to libwebsockets. Can you please check if this issue persists after you use PThreads?

mehmetcy84 commented 11 months ago

I am not sure if I am following correctly. I haven't disabled pthreads.

Setting this value to 0 -DLWS_WITH_MINIMAL_EXAMPLES=0 // modified in /CMake/Dependencies/libwebsockets-CMakeLists.txt

Would this change affect the usage of pthreads? Isn't this flag just for building some samples?

mehmetcy84 commented 11 months ago

I see that in the workflow, build_windows_openssl.bat is used instead of build_windows.bat. And this file has additional configuration lines for pthread. Now my confusion about disabling pthread is resolved.

I have reverted all my changes, so that my git head is at kvs master branch SHA-1: 0ae9ec6b80deff9df724555927fb6595a23054f1 again.

I made sure the pthread folder is where it is expected, and ran build_windows_openssl.bat and the build succeeded without any errors.

I ran the kvsWebrtcClientViewer.exe again, and receive audio frames. Then I set the policy to "RELAY ONLY", and the client again cannot make peer connection. So the issue persists. kinesisLogs.txt The logs are attached.

niyatim23 commented 11 months ago

Hi @mehmetcy84, I see the error STATUS_TURN_CONNECTION_ALLOCAITON_FAILED in your logs. Is us-east-2 geographically closer to you than any other AWS region? Can you check the connection establishment with ICE_TRANSPORT_ALL instead? The sample application is already configured to use relay candidates with useTurn in createSampleConfiguration

mehmetcy84 commented 11 months ago

I am using the credentials and region information my company is currently using, I am not sure if they make use of any other regions but I will check.

In its default configuration, useTURN is true in createSampleConfiguration, and ICE_TRANSPORT_POLICY_ALL is selected. In its default configuration, I can make the webRTC connection, but only because srflx candidates are the ones that succeed all the times. And when my co-worker runs the code, they can never connect because for some reason srflx candidates do not work for them.

We were expecting to see that relay candidates are selected as the last line of defense, as the worst case scenario happens and no host or srflx candidates work. But this does not happen.

So seeing my coworker not being able to connect at all, and me connecting via srflx candidate all the time made me realise: TURN candidates never work for us.

And that's why I wanted to test TURN candidates specifically and selected ICE_TRANSPORT_POLICY_RELAY, to be able to demonstrate that relay candidates do not work.

So, what might be the reason we get STATUS_TURN_CONNECTION_ALLOCAITON_FAILED? Thanks and best regards,

mehmetcy84 commented 11 months ago

Just checked and verified that regions are not configurable for our project. So I would like to find out if there is a solution for making use of TURN without changing regions.

mehmetcy84 commented 11 months ago

I have run the sample viewer client exe from our UK office via remote access, and TURN candidates which are of [us-east-#] region do work fine, and WebRTC connection now succeeds.

But what would be the solution to enable access to our server with a fixed region, worldwide via TURN?

niyatim23 commented 11 months ago

@mehmetcy84, can you please help me understand your setup?

  1. Where is your master?
  2. Where is your viewer?
  3. Are they in the same region?
  4. Are you located in us-east-2? If not, where are you trying to access us-east-2 from?
  5. What is the network like on the master and the viewer? Are there any firewalls enabled? If yes, we would recommend turning them off before trying to establish the connection

STATUS_TURN_CONNECTION_ALLOCATION_FAILED happens when the turn allocation is tried the max number of times and fails. From your logs it seems like it continues to try with the same candidate and fails the turn state machine:

2023-11-27 10:39:38.528 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.555 VERBOSE turnConnectionStepState(): Updated turn allocation request credential after receiving 401
2023-11-27 10:39:38.555 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_GET_CREDENTIALS to TURN_STATE_ALLOCATION
2023-11-27 10:39:38.589 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.653 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.715 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.740 ERROR   turnConnectionStepState(): operation returned status code: 0x5a00002a

I see that a new candidate is available later but I see that it results in a similar failure:

2023-11-27 10:39:38.970 VERBOSE incomingRelayedDataHandler(): Candidate id: fLNl1mV8S
2023-11-27 10:39:38.985 ERROR   turnConnectionTimerCallback(): operation returned status code: 0x00000001
2023-11-27 10:39:38.985 ERROR   timerQueueExecutor(): operation returned status code: 0x00000001
2023-11-27 10:39:38.985 VERBOSE turnConnectionStepState(): Updated turn allocation request credential after receiving 401
2023-11-27 10:39:38.985 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_GET_CREDENTIALS to TURN_STATE_ALLOCATION
2023-11-27 10:39:39.023 VERBOSE incomingRelayedDataHandler(): Candidate id: fLNl1mV8S
2023-11-27 10:39:39.048 ERROR   turnConnectionTimerCallback(): operation returned status code: 0x00000001
2023-11-27 10:39:39.048 ERROR   timerQueueExecutor(): operation returned status code: 0x00000001
2023-11-27 10:39:39.093 VERBOSE incomingRelayedDataHandler(): Candidate id: fLNl1mV8S
2023-11-27 10:39:39.109 ERROR   turnConnectionTimerCallback(): operation returned status code: 0x00000001
2023-11-27 10:39:39.109 ERROR   timerQueueExecutor(): operation returned status code: 0x00000001
2023-11-27 10:39:39.147 VERBOSE incomingRelayedDataHandler(): Candidate id: fLNl1mV8S
2023-11-27 10:39:39.171 ERROR   turnConnectionTimerCallback(): operation returned status code: 0x00000001
2023-11-27 10:39:39.171 ERROR   timerQueueExecutor(): operation returned status code: 0x00000001
2023-11-27 10:39:39.171 ERROR   turnConnectionStepState(): operation returned status code: 0x5a00002a
2023-11-27 10:39:39.171 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_ALLOCATION to TURN_STATE_FAILED
mehmetcy84 commented 11 months ago

Hello, thanks for the reply.

1.Where is your master?

The WebRTC Test Page Master I am trying to make a WebRTC connection to is located in the UK and/or in Turkiye. Both cases fail.

  1. Where is your viewer?

The Viewer client is run in Turkiye. (Fails)

3.Are they in the same region?

Both master and client are configured with region us-east-1 or us-east-2. (Our actual aim is to connect to some cameras which makes use of regions us-east-1 and us-east-2 along with eu-west-2. So we have to make sure we are able to connect to us-east-#)

4.Are you located in us-east-2? If not, where are you trying to access us-east-2 from?

Geographically, I am with my failing client located in Turkiye, and making use of access/secret key pair coupled with us-east-1&2 regions.

5.What is the network like on the master and the viewer? Are there any firewalls enabled? If yes, we would recommend turning them off before trying to establish the connection

I will check this, but I thought TURN would be a solution for problematic network cases or firewalls. I will try disabling firewalls and try again.

mehmetcy84 commented 11 months ago

Ok, so we have discussed the possible impact of my geographical location to describe the failure of TURN connection.

Here is a contradictary observation to note: I have a set of Access/Secret Key and my region: us-east-1

I use the same Access/Secret Key and the same region:us-east-1

So does this not eliminate the possible effect of geographical location?

mehmetcy84 commented 11 months ago

P.S. I have now my firewall disabled. On my PC, I run the KVS WebRTC Test Page running as master (TURN ONLY) on my browser and the kvsWebRTCClientViewer.exe (forced to work with TURN ONLY) on the same PC. And the WebRTC connection still fails.

mehmetcy84 commented 11 months ago

Ok, so we have discussed the possible impact of my geographical location to describe the failure of TURN connection.

Here is a contradictary observation to note: I have a set of Access/Secret Key and my region: us-east-1

  • I have a PC in UK. I run the KVS WebRTC Test Page as Master. I choose TURN ONLY.
  • I have a PC in Turkiye. I run the KVS WebRTC Test Page as Viewer. I choose TURN ONLY. -> Here the WebRTC connection succeeds, and via chrome://webrtc-internals, I verify that the succeeded connection uses relay candidates for both ends.

I use the same Access/Secret Key and the same region:us-east-1

  • I have the same PC in UK. I run the KVS WebRTC Test Page as Master. I choose TURN ONLY.
  • I now instead run the unmodified Kinesis C SDK Sample viewer client exe. ( the only modification being the forcing TURN ONLY in samples/common.c) -> WebRTC connection fails, and I receive those logs where TURN connection is being reported to fail.

So does this not eliminate the possible effect of geographical location?

@niyatim23 Can you think of anyway to verify why WebRTC Test Page viewer might be able make the TURN connection with the same credentials/region info, but not the C SDK sample viewer?

stefankiesz commented 10 months ago

Hi @mehmetcy84,

Looking at the following excerpt from your logs, I see the first three incomingRelayedDataHandler attempts for a candidate fail, but the fourth one is successful as logged by turnConnectionHandleStun.

2023-11-27 10:39:38.555 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_GET_CREDENTIALS to TURN_STATE_ALLOCATION
2023-11-27 10:39:38.589 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.653 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.715 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.740 ERROR   turnConnectionStepState(): operation returned status code: 0x5a00002a
2023-11-27 10:39:38.740 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_ALLOCATION to TURN_STATE_FAILED
2023-11-27 10:39:38.778 VERBOSE incomingRelayedDataHandler(): Candidate id: 4+0Mpjz6i
2023-11-27 10:39:38.778 DEBUG   turnConnectionHandleStun(): TURN Allocation succeeded. Life time: 600 seconds. Allocation expiration epoch 170108217877807
2023-11-27 10:39:38.802 WARN    turnConnectionStepState(): TurnConnection in TURN_STATE_FAILED due to 0x5a00002a. Aborting TurnConnection
2023-11-27 10:39:38.802 DEBUG   turnConnectionStepState(): TurnConnection state changed from TURN_STATE_FAILED to TURN_STATE_CLEAN_UP

The turnConnectionStepState invocations check on an atomic boolean hasAllocation to determine whether the allocation was successful. If unsuccessful, the state will not change from TURN_STATE_ALLOCATION. The calling of turnConnectionStepState is handled by a timer on a 50ms interval, so we check for this hasAllocation every 50ms, but only up to DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT, which is currently set to 3. Looking at the timestamps associated with incomingRelayedDataHandler, we see it takes more than 50ms in between each of the four calls, so that's why it never gets to handle the success case on the fourth invocation as it has surpassed the 150ms "time out".

Can you please try to set the DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT to a higher number here, say 10 for now in the below line in turnConnection.h and see if that succeeds past the allocation state?

#define DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT 3

Furthermore, I'd also like to investigate why it fails the first three times. That hasAllocation boolean is only ever set to TRUE upon turnConnectionHandleStun receiving a successful allocation response; turnConnectionHandleStun gets invoked by turnConnectionIncomingDataHandler. We know turnConnectionIncomingDataHandler gets invoked all four times since the "Candidate id: 4+0Mpjz6i" line gets logged just before its invocation. Can we add logs into turnConnectionIncomingDataHandler, especially just before the turnConnectionHandleStun invocation here to confirm turnConnectionHandleStun is invoked? And within turnConnectionHandleStun, let's add logs to understand why we don't get to this line which sets hasAllocation to true.

mehmetcy84 commented 10 months ago

Hi @stefankiesz, Thanks for the reply. I will be testing with what you have suggested and return with an answer soon. Best regards,

mehmetcy84 commented 10 months ago

H @stefankiesz, Currently I am occupied with some other things to attend, but I quickly gave it a try to make the following change: #define DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT 10 This didn't resolve the issue. I am planning to provide logs and more details, and will capture the additional logs you had asked for.

I will be able to return with these logs in a week. Is it ok if the the issue is kept open until I find the time to respond again with more detail, as I am not authorized to re-open a closed issue with a new comment, and I do not wish to lose this thread. Best regards,

disa6302 commented 10 months ago

We will keep this issue open.

tplusy commented 10 months ago

I found this issues too.

mehmetcy84 commented 9 months ago

Thanks for keeping this issue open.

Hi @stefankiesz, I have added the logs you had asked for.

Btw, I have updated my branch to latest master 1.9.1 (commit id: bfa6667e2f2eeb800a0edd6e6e4745b4faf34536)

In the following tests, the only changes in the code are debug prints and forcing of RELAY_ONLY candidates. In src/common.c initializePeerConnection function:

// Set the ICE mode explicitly
//configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_ALL;
configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_RELAY;

Test case 1: RetryParamDefault3WithLogs.txt

define DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT 3

The logs show that both a) At turnconnection.c file, in turnConnectionIncomingDataHandler function, turnConnectionHandleStun is called.

DLOGW("mehmetcy2>> turnConnectionIncomingDataHandler calling turnConnectionHandleStun");
CHK_STATUS(turnConnectionHandleStun(pTurnConnection, pCurrent, processedDataLen));

b) At turnconnection.c file, in turnConnectionHandleStun function, hasAllocation is set to TRUE.

DLOGW("mehmetcy2>> turnConnectionHandleStun-2 case STUN_PACKET_TYPE_ALLOCATE_SUCCESS_RESPONSE setting hasAllocation TRUE");
ATOMIC_STORE_BOOL(&pTurnConnection->hasAllocation, TRUE);

Nevertheless, turn state is declared as TURN_STATE_FAILED.

Test case 2: RetryParamMade10WithLogs.txt

The following change is applied: //#define DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT 3

define DEFAULT_TURN_ALLOCATION_MAX_TRY_COUNT 10

Following TURN state changes are observed in the logs: TURN_STATE_NEW -> TURN_STATE_CHECK_SOCKET_CONNECTION TURN_STATE_CHECK_SOCKET_CONNECTION -> TURN_STATE_GET_CREDENTIALS TURN_STATE_GET_CREDENTIALS -> TURN_STATE_ALLOCATION TURN_STATE_ALLOCATION -> TURN_STATE_CREATE_PERMISSION TURN_STATE_CREATE_PERMISSION -> TURN_STATE_BIND_CHANNEL TURN_STATE_BIND_CHANNEL -> TURN_STATE_READY TURN_STATE_READY -> TURN_STATE_CLEAN_UP

And the program exits. Both logs are attached. If you need, I can add more logs to help understand why TURN fails with retry param set to 10. Best regards,

disa6302 commented 9 months ago

Looking at the logs, it seems with the change, the relay candidates are created successfully, but connectivity checks are failing i.e, no candidate pairs are found within 10 seconds. An observation that is strange:

grep "New remote ice candidate discovered" RetryParamMade10WithLogs.txt
2024-01-19 14:05:11.452 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 192.168.1.104:65450. Type: host. Protocol: udp.
2024-01-19 14:05:11.502 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 192.168.56.1:65451. Type: host. Protocol: udp.
2024-01-19 14:05:11.641 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 78.170.129.117:65450. Type: srflx. Protocol: udp.
2024-01-19 14:05:11.820 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 52.91.222.140:53758. Type: relay. Protocol: udp.
2024-01-19 14:05:11.848 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 3.94.126.20:53871. Type: relay. Protocol: udp.
2024-01-19 14:05:12.267 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 52.91.222.140:56089. Type: relay. Protocol: udp.
2024-01-19 14:05:12.300 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 52.91.222.140:62506. Type: relay. Protocol: udp.
2024-01-19 14:05:12.378 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 3.94.126.20:59005. Type: relay. Protocol: udp.
2024-01-19 14:05:12.380 DEBUG   iceAgentLogNewCandidate(): New remote ice candidate discovered. Id: fZ+4XiMkI. Ip: 3.94.126.20:59687. Type: relay. Protocol: udp.

See how all the remote candidates have the same ID? This shouldnt be the case since we generate a random JSON string. Wondering if there is any code modification here because I see the same ID in both the logs files you attached too.

A few questions:

  1. When you tried the JS Master (from UK to us-east-1) and JS viewer (from Turkiye to us-east-1), how fast was the connection establishment?
  2. Just curious if you have tried JS Viewer and C Master and if you have had success with it.
  3. I understand JS worked with using us-east-1, but can you try a region that is closer to master and run the same test and see what happens? Or select a region that is central for Turkiye and UK - maybe eu-central-1?
mehmetcy84 commented 9 months ago

Apart from the common.c change:

// Set the ICE mode explicitly
//configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_ALL;
configuration.iceTransportPolicy = ICE_TRANSPORT_POLICY_RELAY;

and the newly added debug prints, there are no changes at all in the code. It is identical to 1.9.1.

Answers:

  1. I have attached the "Debug" level logs from the WebRTC Test Page console for UK JS Master and Turkiye JS Viewer. The connection was established quite fast as can be seen from the logs too. Both Master and Viewer are set as "Relay Only". jKVSClient.txt jKVSMaster.txt
  2. I will give this a try next week.
  3. Our application's region is fixed and we are curious if there could be a workaround for our current fixed regions. Thanks and best regards,
niyatim23 commented 9 months ago

Hi @mehmetcy84, I tried ICE_TRANSPORT_POLICY_RELAY on a Windows EC2 instance with v1.9.1 and it works without any issues for me. I could see the stream on the JS test page with no issues. Also, while trying this, I'm physically located in a region different from the one in which my stream exists

2024-02-06 18:20:13.017 PROFILE iceAgentReadyStateSetup(): Selected pair RkoWQGk0u_fZ+4XiMkI, local candidate type: relay. remote candidate type: relay. Round trip time 265 ms. Local candidate priority: 16777215, ice candidate pair priority: 72057589826734591

Do you have any updates on this issue?