aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.09k stars 577 forks source link

AWS Rekognition timeout error happening randomly #4292

Closed IgorSamer closed 4 months ago

IgorSamer commented 1 year ago

Checkboxes for prior research

Describe the bug

I have an application in Electron that does facial recognition of people to then decide whether or not they can enter the place and for that I'm using Amazon Rekognition.

Everything was working fine (for a few months) until, three days ago, a customer reported to me that the app was behaving strangely, like it wasn't responding to requests for facial recognition.

After several tests, I discovered that what is happening with it is a timeout error, which occurs in all API calls, whether they are looking for faces (SearchFacesByImage) or registering new faces (IndexFaces).

What intrigued me was the fact that everything was working fine, until this behavior just started happening (and I didn't make any code changes/updates to the app running on my client's computer).

And what makes me even more intrigued is that this behavior occurs completely randomly and only on the machine of that client in question. Sometimes the API calls work correctly (returning whether the person was recognized or not), but most of the time, the calls take about 90 seconds to return the timeout error. When executing the same code on my machine (same methods and same CollectionId) everything runs normally and there was no timeout error at any time - while at the exact same moment on my client's machine the behavior continues.

I was using aws-sdk and then switched to @aws-sdk/client-rekognition (thinking that could solve the problem) but the code only worked on a few of the first calls to the API and a few minutes later it got the timeout errors again.

Just remembering that: during all tests on my client's computer the internet connection was stable and working properly.

What is the best way to investigate and resolve this issue?

SDK version number

@aws-sdk/client-rekognition@^3.229.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

18.12.0

Reproduction Steps

The code I'm using to configure and make calls to Rekognition is basically this:

const { RekognitionClient, IndexFacesCommand, SearchFacesByImageCommand } = require('@aws-sdk/client-rekognition')

const rekognitionClient = new RekognitionClient({
    credentials: {
        accessKeyId: 'accessKeyId',
        secretAccessKey: 'secretAccessKey'
    },
    region: 'us-east-1'
})

const registerFaceOnRekognition = async (bytes, userId) => {
    const params = {
        CollectionId: 'collectionId',
        Image: { Bytes: bytes },
        ExternalImageId: userId,
        MaxFaces: 1,
        QualityFilter: 'HIGH'
    }

    const command = new IndexFacesCommand(params)

    try {
        const { FaceRecords } = await rekognitionClient.send(command)

        if (!FaceRecords.length) {
            console.log('No faces detected.')

            return
        }

        console.log('Face created:')
        console.log(FaceRecords[0].Face.FaceId)
    } catch (error) {
        console.error(error) // timeout error
    }
}

const searchFaceByImageOnRekognition = async (bytes) => {
    const params = {
        CollectionId: 'collectionId',
        Image: { Bytes: bytes },
        MaxFaces: 1,
        FaceMatchThreshold: 99,
        QualityFilter: 'HIGH'
    }

    const command = new SearchFacesByImageCommand(params)

    try {
        const { FaceMatches } = await rekognitionClient.send(command)

        if (!FaceMatches.length) {
            console.log('This face has not been registered yet')

            return
        }

        console.log('Face found:')
        console.log(FaceMatches[0].Face.ExternalImageId)
    } catch (error) {
        console.error(error) // timeout error
    }
}

// Method called through the renderer process that has a canvas where the webcam view is reproduced
const onTakePicture = (event, data) => {
    const bytes = Buffer.from(data.dataURL.replace('data:image/jpeg;base64,', ''), 'base64')

    // If there is a userId, register the face in the image
    if (data.userId) {
        registerFaceOnRekognition(bytes, data.userId)

        return
    }

    // Else, search for the face in the image
    searchFaceByImageOnRekognition(bytes)
}

Observed Behavior

{
    "message": "connect ETIMEDOUT 3.226.60.54:443",
    "errno": -4039,
    "code": "TimeoutError",
    "syscall": "connect",
    "address": "3.226.60.54",
    "port": 443,
    "time": "2022-12-14T13:50:10.909Z",
    "region": "us-east-1",
    "hostname": "rekognition.us-east-1.amazonaws.com",
    "retryable": true
}

Expected Behavior

No timeout errors.

Possible Solution

Could this behavior be related to some blocking of my client's IP by AWS after several requests made by him or something like that?

Additional Information/Context

No response

yenfryherrerafeliz commented 1 year ago

{ "message": "connect ETIMEDOUT 3.226.60.54:443", "errno": -4039, "code": "TimeoutError", "syscall": "connect", "address": "3.226.60.54", "port": 443, "time": "2022-12-14T13:50:10.909Z", "region": "us-east-1", "hostname": "rekognition.us-east-1.amazonaws.com", "retryable": true }

Hi @IgorSamer, thanks for opening this issue. Are you getting the data above by logging the response from a request being made?, are you making sure there is not connectivity issues from said machine?, is there any stacktrace error from the machine where the error is happening?, and could you please also enable debug logs and provide them to better investigate this.

Note: Please make sure you redact any sensitive information.

Thanks!

IgorSamer commented 1 year ago

Hi @yenfryherrerafeliz! Yes, this is the return I got directly from my client's machine through the console.error(error) of the catch block (as shown in the code).

I can confirm that connectivity is ok on the machine, and I have just been reported on my Sentry.io dashboard that the error was triggered again (as well as at various other times during the day):

Error: connect ETIMEDOUT 3.223.19.56:443
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1161:16) {
  errno: -4039,
  code: 'ETIMEDOUT',
  syscall: 'connect',
  address: '3.223.19.56',
  port: 443,
  name: 'TimeoutError',
  '$metadata': { attempts: 4, totalRetryDelay: 222 }
}

is there any stacktrace error from the machine where the error is happening?

No, the app just makes the calls normally and gets the responses about 90 seconds later with the timeout error.

could you please also enable debug logs and provide them to better investigate this

I'll enable debug logs and get back to you with more information as soon as possible.

IgorSamer commented 1 year ago

@yenfryherrerafeliz here it is! The debug.log file can be found at: https://gist.github.com/IgorSamer/4e58e09f3fa615401f85ca325b794245

In it, the first three requests (2022-12-16T13:48:45.932Z, 2022-12-16T13:53:20.325Z and 2022-12-16T14:19:12.479Z) occur normally. However, all other consecutive requests start to give the timeout error, where, in fact, no data is returned after the [DEBUG] App: endpoints Resolved endpoint: step.

From yesterday until now, my dashboard has already received this error 58 times and as previously mentioned the internet connection was working fine. I could managing to reproduce the error via remote access too, that is, the machine internet was ok.

yenfryherrerafeliz commented 1 year ago

Error: connect ETIMEDOUT 3.223.19.56:443 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1161:16) { errno: -4039, code: 'ETIMEDOUT', syscall: 'connect', address: '3.223.19.56', port: 443, name: 'TimeoutError', '$metadata': { attempts: 4, totalRetryDelay: 222 } }

@IgorSamer I see that the SDK tried 4 times to connect but was not able to reach the service endpoint, and also the fact that there is not a requestId field in the metadata object tells me that the request is not even getting out from the machine. This seems to be a networking issue, maybe a firewall is blocking the outgoing requests, perhaps after certain number of operations. Something you could try is to ping this host "rekognition.us-east-1.amazonaws.com" at the time the error is happening just to see if there is traffic from that host at that time, and if there is not, that means something in the machine's network is causing the issue.

Looking forward to your response.

Thanks!

IgorSamer commented 1 year ago

@yenfryherrerafeliz first of all, thanks for the responses!

I had also considered this possibility, since it is a very specific case (it only happens on this machine and sometimes it works and sometimes it doesn't), but since I have no experience in networks, I didn't even know where to start.

At first I tested the ping rekognition.us-east-1.amazonaws.com directly on my machine and the returns were "Request timed out", but on my computer I don't experience the same problem with the SDK - I just tested it and while the ping command returns "request timed out" the SDK works correctly.

How can I successfully run this test on my machine before trying it with my client who has the problem?

aBurmeseDev commented 4 months ago

Checking in here @IgorSamer - apologies for the long silence. Let us know if you're still experiencing this issue.

IgorSamer commented 4 months ago

After much investigation I contacted the ISP and for a week we worked together to find out what it could be. In the end, it was a problem caused by outdated equipment on the ISP's side and after the update on their part the API calls returned to normal.

github-actions[bot] commented 3 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.