awslabs / amazon-kinesis-video-streams-webrtc-sdk-c

Amazon Kinesis Video Streams Webrtc SDK is for developers to install and customize realtime communication between devices and enable secure streaming of video, audio to Kinesis Video Streams.
https://awslabs.github.io/amazon-kinesis-video-streams-webrtc-sdk-c/group__PublicMemberFunctions.html
Apache License 2.0
1.01k stars 304 forks source link

[Bug]: TURN_STATE_GET_CREDENTIALS fails after update from 1.7.2. to 1.10.2 of the SDK #2004

Open skndi opened 2 months ago

skndi commented 2 months ago

Please confirm you have already done the following

Please answer the following prompt

Describe the bug

The state TURN_STATE_GET_CREDENTIALS of the TurnConnectionStateMachine fails due to error 0x58000022.

Expected Behavior

The expected behavior is to get the credentials successfully and discover a viable TURN ICE candidate.

Current Behavior

2024-06-03 12:20:26.645 DEBUG socketConnectionClosed(): Close socket 15
2024-06-03 12:20:26.645 DEBUG connectionListenerReceiveDataRoutine(): recvfrom() failed with errno Connection reset by peer for socket 15 2024-06-03 12:20:26.653 VERBOSE stepStateMachine(): [TURN-0x7f6c054d00] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000004, Current local state retry count [1] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.653 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000004, Current local state retry count [1] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.654 WARN socketConnectionSendData(): Warning: Failed to send data. Socket closed already 2024-06-03 12:20:26.654 ERROR iceUtilsSendData(): operation returned status code: 0x58000022 2024-06-03 12:20:26.654 ERROR iceUtilsSendStunPacket(): operation returned status code: 0x58000022 2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000004, Next state: 0x0000000000000100, Current local state retry count [0] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection 2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_GET_CREDENTIALS to TURN_STATE_FAILED. 2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000100, Next state: 0x0000000000000080, Current local state retry count [0] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_FAILED to TURN_STATE_CLEAN_UP. 2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000080, Next state: 0x0000000000000100, Current local state retry count [0] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection 2024-06-03 12:20:26.654 DEBUG stepTurnConnectionStateMachine(): [0x7f6c0863c0] Turn connection state changed from TURN_STATE_CLEAN_UP to TURN_STATE_FAILED. 2024-06-03 12:20:26.654 VERBOSE stepStateMachine(): [TURN-0x7f6c0863c0] State Machine - Current state: 0x0000000000000100, Next state: 0x0000000000000100, Current local state retry count [1] , Max local state retry count [0], State transition wait time [0] ms 2024-06-03 12:20:26.654 WARN executeFailedTurnState(): TurnConnection in TURN_STATE_FAILED due to 0x58000022. Aborting TurnConnection

Reproduction Steps

AWS_KVS_LOG_LEVEL=1 ./kvsWebrtcClientMaster 0 opus h264 and the KVS WebRTC Test Page as viewer, with NAT traversal set to TURN only

WebRTC C SDK version being used

1.10.2

If it was working in a previous version, which one?

1.7.2

Compiler and Version used

GCC 10.2.1

Operating System and version

Embedded aarch64 Linux

Platform being used

aarch64

disa6302 commented 2 months ago

@skndi ,

Can you provide the complete logs? Has anything changed in you set up?

skndi commented 2 months ago

Hello @disa6302, of course, here are the logs https://gist.github.com/skndi/add4220e87540bf719601cd651c1d818 For your second question, we have small adaptations for our specific use case, but to eliminate the possibility of problems from them, I checked with the sample code - the 1.7.2 sample works as expected, the sample from 1.10.2 fails in the TURN_STATE_GET_CREDENTIALS state

disa6302 commented 2 months ago

@skndi ,

Thank you for the logs. Although the get-credentials (reason unknown at the moment), want to confirm if you see a connection establishment failure? There are 2 state machines that run in the TURN path, I see logs only for one of them and the other one seems incomplete. Do you more logs than this? Want to check if the ICE agent state machine fails too and what happens to the other TURN state machine loop.

skndi commented 2 months ago

Hi @disa6302,

The file is opened truncated by default from the gist link and shows only 7000 lines from ~50000, I think that's why you don't see the second TURN state machine logs. There is a link in the gist to the raw file - https://gist.githubusercontent.com/skndi/add4220e87540bf719601cd651c1d818/raw/4d03564900e68ed693bdb0a5d324aa95b1b96616/gistfile1.txt, can you check if it contains all logs you need? From my check I see that both state machines have entered the FAILED state, but they report different errors for it - 0x5a00002c, which is STATUS_TURN_CONNECTION_GET_CREDENTIALS_FAILED and the other one 0x58000022 which is STATUS_SOCKET_CONNECTION_CLOSED_ALREADY. Also at the end of the logs the ICE agent state machine fails with 0x5a00000d - STATUS_ICE_NO_CONNECTED_CANDIDATE_PAIR

skndi commented 2 months ago

Today I did a git bisect and the commit where the problem starts appearing is the one for the 1.8.0 release - e9f3cab63756669746cdf79a9254778b6c4cf11e, if that helps your investigation. My next step is to go on the develop branch and perform a bisect there, to see the exact commit which creates the issue.

disa6302 commented 2 months ago

@skndi ,

Thank you for the details. What appears to be happening is that the credentials could not be fetched in 5 seconds, which is the timeout on that state. But this timeout existed in v1.7.2. Could you confirm if anything changed for you in the platform?

To test it out, could you also try to increase the value here in 1.10.2 and see what happens?

skndi commented 2 months ago

Hi @disa6302,

Today I performed a bisect on https://github.com/awslabs/amazon-kinesis-video-streams-webrtc-sdk-c/pull/1776 which introduces the 1.8.0 release and have found the commit after which the sample stops working - ee5e345a179b48eaed6f7fa8784f1b785399fcad. I tried reverting it in the PR branch and the sample worked after that. On Monday I will try if reverting it on v1.10.2 works and will investigate what exactly in the changes makes it stop working on our system.

skndi commented 2 months ago

Hi,

I have investigated from our side and by a process of elimination reached the conclusion that not calling getaddrinfo in getIpWithHostName() later causes the TURN state machines to fail. In this function if I add a redundant call to getaddrinfo in the branch where parsing the IP address from the TURN server hostname is successful, then ICE succeeds and we get a working stream. I tried researching what getaddrinfo does under the hood and looked into the libc source code, but wasn't able to get an idea why this might be so, since it should only resolve hostnames to IP addresses. Our solution for the moment would be to get back to using getaddrinfo for address resolution. Do you have any insight why this might work that way, since to me it's quite confusing?

disa6302 commented 2 months ago

@skndi ,

Thanks for all the deep dive!. This is very interesting. I see this log line in your logs: ICE Server address for 3-74-151-217.t-7d7a0979.kinesisvideo.eu-central-1.amazonaws.com: 3.74.151.217 which indicates successful parsing of IP from the ICE server URL.

What is the IP address that gets logged if you use getaddrinfo? Is it different?

skndi commented 2 months ago

@disa6302 No, the address found by getaddrinfo is the same. The parsing of the address works fine, but for the later TURN state machine connections to work a call to getaddrinfo must be made, even if it's result is unused. As a test I used the following code:

STATUS getIpWithHostName(PCHAR hostname, PKvsIpAddress destIp)
{
    STATUS retStatus = STATUS_SUCCESS;
    INT32 errCode;
    UINT16 hostnameLen, addrLen;
    PCHAR errStr;
    struct addrinfo *res, *rp;
    BOOL resolved = FALSE;
    struct sockaddr_in* ipv4Addr;
    struct sockaddr_in6* ipv6Addr;
    struct in_addr inaddr;

    CHAR addr[KVS_IP_ADDRESS_STRING_BUFFER_LEN + 1] = {'\0'};

    CHK(hostname != NULL, STATUS_NULL_ARG);
    DLOGI("ICE SERVER Hostname received: %s", hostname);

    hostnameLen = STRLEN(hostname);
    addrLen = SIZEOF(addr);

    // Adding this check in case we directly get an IP address. With the current usage pattern,
    // there is no way this function would receive an address directly, but having this check
    // in place anyways
    if (isIpAddr(hostname, hostnameLen)) {
        MEMCPY(addr, hostname, hostnameLen);
    } else {
        retStatus = getIpAddrFromDnsHostname(hostname, addr, hostnameLen, addrLen);
    }

    // Verify the generated address has the format x.x.x.x
    if (!isIpAddr(addr, hostnameLen) || retStatus != STATUS_SUCCESS) {
        DLOGW("Parsing for address failed for %s, fallback to getaddrinfo", hostname);
        errCode = getaddrinfo(hostname, NULL, NULL, &res);
        if (errCode != 0) {
            errStr = errCode == EAI_SYSTEM ? (strerror(errno)) : ((PCHAR) gai_strerror(errCode));
            CHK_ERR(FALSE, STATUS_RESOLVE_HOSTNAME_FAILED, "getaddrinfo() with errno %s", errStr);
        }
        for (rp = res; rp != NULL && !resolved; rp = rp->ai_next) {
            if (rp->ai_family == AF_INET) {
                ipv4Addr = (struct sockaddr_in*) rp->ai_addr;
                destIp->family = KVS_IP_FAMILY_TYPE_IPV4;
                MEMCPY(destIp->address, &ipv4Addr->sin_addr, IPV4_ADDRESS_LENGTH);
                resolved = TRUE;
            } else if (rp->ai_family == AF_INET6) {
                ipv6Addr = (struct sockaddr_in6*) rp->ai_addr;
                destIp->family = KVS_IP_FAMILY_TYPE_IPV6;
                MEMCPY(destIp->address, &ipv6Addr->sin6_addr, IPV6_ADDRESS_LENGTH);
                resolved = TRUE;
            }
        }
        freeaddrinfo(res);
        CHK_ERR(resolved, STATUS_HOSTNAME_NOT_FOUND, "Could not find network address of %s", hostname);
    }

    else {

        /////////////////////////////////////////////////////////////
        errCode =  getaddrinfo(hostname, NULL, NULL, &res);
        freeaddrinfo(res);
        ////////////////////////////////////////////////////////////

        inet_pton(AF_INET, addr, &inaddr);
        destIp->family = KVS_IP_FAMILY_TYPE_IPV4;
        MEMCPY(destIp->address, &inaddr, IPV4_ADDRESS_LENGTH);
    }

CleanUp:
    CHK_LOG_ERR(retStatus);
    return retStatus;
}

The result of the getaddrinfo call in the else case is not used in any way, but with this code we have a successful stream establishment. So my guess is that getaddrinfo has some side effect, which affects later connections, but I wasn't able to find anything obvious to me in the source code of the function. It's possible there is some quirk in our system also. For our use case it's good enough to go back to just using getaddrinfo, so that's what we decided to do. I'm posting the info in case you want to further research the issue, and since it would be very interesting to me if someone discovered the cause for this.

m-j-ahmadi commented 1 month ago

Hi, We have encountered the same problem after updating the SDK from version 1.7.3 to 1.10.2. However, when we downgrade the SDK to 1.9.1, the problem is resolved. Any advice or further updates on this would be greatly appreciated.

disa6302 commented 1 month ago

@m-j-ahmadi , If 1.9.1 works, perhaps it is not the same issue since the commit @skndi is pointing to has been in the SDK since 1.8.0. Have you tried to increase the value here in 1.10.2 and see what happens? Also, is your system windows too?

@skndi , On that note, curious if you tested 1.9.1 as well and saw the same issue.

samanazadpour commented 1 month ago

Hi @disa6302 I am a colleague of @m-j-ahmadi we were thinking that 1.9.1 works but after testing few times, I could see that sometimes it works and most of the time it does not work. It seems like 1.9.1 does not work as well

hassanctech commented 1 month ago

@skndi Can you please note which hardware platform you're on? You mentioned embedded linux arm64, but a few more details would help are you running linux on an SoC / any further details you can provide?

samanazadpour commented 1 month ago

@hassanctech we are also experiencing the same issue, not possible to migrate from 1.7.3 to 1.10.2 we had to downgrade and stay with 1.7.3 until we find the solution. we tested on Jetson nano previous version, Jetson Orin new version, and X64 Microsoft Azure VM with Ubuntu 22

skndi commented 1 month ago

Hi @hassanctech, we are running on an SoC with an Ambarella CV2 chip, the Linux distro is also provided by them.

skndi commented 1 month ago

As a bit of extra info here is the user agent string user-agent -> AWS-WEBRTC-KVS-AGENT/1.10.2/1.5.2 GCC/10.2.1 Linux/5.4.215 aarch64