aws-amplify / amplify-js

A declarative JavaScript library for application development using cloud services.
https://docs.amplify.aws/lib/q/platform/js
Apache License 2.0
9.44k stars 2.13k forks source link

Clock Drift causing client to fire off infinite number of uncatchable 4xx errors to set up GraphQL #12450

Closed gracektay closed 1 year ago

gracektay commented 1 year ago

Before opening, please confirm:

JavaScript Framework

React

Amplify APIs

Authentication, GraphQL API

Amplify Categories

auth, api

Environment information

``` # Put output below this line System: OS: macOS 13.6 CPU: (10) arm64 Apple M1 Max Memory: 2.52 GB / 32.00 GB Shell: 5.9 - /bin/zsh Binaries: Node: 18.14.0 - ~/.nvm/versions/node/v18.14.0/bin/node npm: 9.5.0 - ~/.nvm/versions/node/v18.14.0/bin/npm Browsers: Chrome: 118.0.5993.117 Safari: 16.6 npmPackages: @osu/footer: ^2.1.0 => 2.0.0-upgrade.0 @osu/navbar: 2.0.0 => 1.2.1-beta.0 @osu/react-ui: ^2.2.0 => 2.1.1-beta.0 @osu/wm-react-analytics: ^1.1.0 => 1.1.0 @reduxjs/toolkit: ^1.9.5 => 1.9.5 @reduxjs/toolkit-query: 1.0.0 @reduxjs/toolkit-query-react: 1.0.0 @testing-library/jest-dom: ^5.16.5 => 5.16.5 @testing-library/react: ^13.4.0 => 13.4.0 @testing-library/user-event: ^13.5.0 => 13.5.0 aws-amplify: 5.3.3 => 5.3.10 axios: ^1.3.4 => 1.3.4 (0.26.0) js-search: ^2.0.1 => 2.0.1 react: ^18.2.0 => 18.2.0 react-dom: ^18.2.0 => 18.2.0 react-idle-timer: ^5.5.2 => 5.5.2 (5.7.2) react-redux: ^8.0.5 => 8.0.5 react-router-dom: ^6.8.1 => 6.8.1 react-scripts: 5.0.1 => 5.0.1 redux: ^4.2.1 => 4.2.1 redux-logger: ^3.0.6 => 3.0.6 redux-thunk: ^2.4.2 => 2.4.2 sass: ^1.58.1 => 1.68.0 web-vitals: ^2.1.4 => 2.1.4 npmGlobalPackages: @aws-amplify/cli: 11.1.1 @osu/react-ui: 2.2.0-beta.0 corepack: 0.15.3 npm: 9.5.0 ```

Describe the bug

We have an Amplify project that allows for auth via user pool and through public identity pool. We are experiencing a high error rate of 4xx errors. We are constantly in an error state. While I have seen a number of tickets related to clock drift, I have not seen one call out an infinite number of calls to the backend. While debugging on the very latest, I set a debug flag on the supposed fixes for this issue and it just re runs that code over and over without stopping. If you do not close the browser, it never stops firing leading to 50,000 client calls and up. We've had to add a rate limiter just to try and help manage this flood of errors.

We were on the very latest of Amplify but downgraded to a version suggested in another issue that slightly helped (5.3.3), but not entirely. The image below shows some of what we have been dealing with, the spike was when we were on the newest Amplify version.

Expected behavior

A client should not be able to fire off tens to hundreds of thousands of times. There should be a controllable number of retries. Once that specific 4xx error is identified, a refreshed, valid signature is created.

Reproduction steps

On an amplify project with graphql and public identity pool, set your clock ahead or behind ten minutes.

Code Snippet

// Put your code below this line.

Log output

``` X-ray: "cause": { "message": "Signature not yet current: 20231027T154000Z is still later than 20231027T140444Z (20231027T135944Z + 5 min.)" } Log insights: Response Headers: {x-amzn-ErrorType=BadRequestException} ```

aws-exports.js

No response

Manual configuration

No response

Additional configuration

No response

Mobile Device

No response

Mobile Operating System

No response

Mobile Browser

No response

Mobile Browser Version

No response

Additional information and screenshots

One week of errors:

Screenshot 2023-10-27 at 10 00 41 AM

Zoom in on latest Amplify:

Screenshot 2023-10-27 at 10 13 02 AM

You can see a near mirror image between errors and our calls. We also never should have had that many calls, its just the client being able to fire off an uncontrollable amount of 4xx errors.

cwomack commented 1 year ago

Hello, @gracektay and thank you for opening this issue. This appears to be related to the clock drift error reported in #11675. Can you share the error you're experiencing or any details on the 4XX error response you're getting? And is there a specific API that's being called (either on the GraphQL or Auth side) that can be pinned as the offender?

gracektay commented 1 year ago

@cwomack Hi! here are lots of pictures showing the issue from the client side. Hope this helps! Client side:

{
  "errors" : [ {
    "errorType" : "BadRequestException",
    "message" : "Signature not yet current: 20231027T153914Z is still later than 20231027T153419Z (20231027T152919Z + 5 min.)"
  } ]
}

Here are some pictures: Screenshot 2023-10-27 at 11 33 54 AM Screenshot 2023-10-27 at 11 33 46 AM Screenshot 2023-10-27 at 11 34 11 AM Screenshot 2023-10-27 at 11 34 02 AM

our call:

import { API } from "aws-amplify"; 
API.graphql({ authMode, query, variables });
gracektay commented 1 year ago

Also, if its helpful to know, we had originally been unsure of the cause and my team had a case opened, CASE 14087351881, forwarded to the AppSync team.

cwomack commented 1 year ago

@gracektay, thank you for the additional information and pics! We are currently investigating this and will update this issue as soon as we can.

david-mcafee commented 1 year ago

There are several previously closed issues related to this. In particular, this issue was resolved by both a library fix, and a fix to the the CLI. @gracektay - can you also share what version of the CLI you are using? Thank you!

cwomack commented 1 year ago

@gracektay, have you also enabled CORS and checked the IAM permissions for the public identity pool? We're having trouble reproducing the issue with GraphQL (but have reproduced it using Pinpoint similar to #9281), and trying to see if we can get the same recurring error you see.

If you have a minimal sample app (or even what the frontend code looks like), that would be helpful too!

kvramyasri7 commented 1 year ago

đŸ‘‹ @gracektay, by any chance are you using Analytics and GraphQL together ? in client side app, Or any api that is using pre-signer?

kvramyasri7 commented 1 year ago

I can confirm, I ran into same issue when hitting getUrl in storage after modifying the clock which internally uses pre-signer to get the url of the object.

gracektay commented 1 year ago

I've set up this app: https://d3rue5wy9qeptt.cloudfront.net using the @aws-amplify/cli at 12.7.1 and aws-amplify at 5.3.11. Setting the time ten minutes forward and back triggers it still. If it fails once, it can still fail in the future. If you cannot get it to trigger I can attach a video, but I confirmed just now that even with both those versions it still does it.

@cwomack Is that enough for you to inspect further?

gracektay commented 1 year ago

This only impacts a couple of users (its a handful of people responsible for 50k requests) and for the most part everything is working as expected (except for when we switch back and forth on time) so my gut feeling is that it is not related to CORS or our permissions. If it was then I'd assume it'd be broken for the hundred thousand other users.

Here are some snippets from our IAM permissions in case its useful (with certain values removed just in case)

Trust relationship:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "cognito-identity.amazonaws.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "cognito-identity.amazonaws.com:aud": "us-east-2:[[identifier-removed]]"
                },
                "ForAnyValue:StringLike": {
                    "cognito-identity.amazonaws.com:amr": "unauthenticated"
                }
            }
        }
    ]
}
permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "mobileanalytics:PutEvents",
                "cognito-sync:*",
                "cognito-identity:*"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "appsync:GraphQL"
            ],
            "Resource": [
                "arn:aws:appsync:us-east-2:[[accountid]]:apis/[[apiid]]/types/*/fields/*",
                "arn:aws:appsync:us-east-2:[[accountid]]:apis/[[apiid]]"
            ],
            "Effect": "Allow"
        }
    ]
}
gracektay commented 1 year ago

@kvramyasri7 Catching up there were lots of comments! Within the app, we leverage the Hub, Auth, API, and Cache. Hub is to handle login, and the cache is for a different page in the app. We are not using Analytics. I think this specific part of the app would likely just touch the API and Auth imports. From the logs it seems like when Amplify is attempting to configure the GraphqQL.

ashwinkumar6 commented 1 year ago

Thanks for sharing more details and the hosted app @gracektay Looks like irrespective of my local time (behind/ forward / current) the app seems to always make a network call every 1min. If my time is incorrect (behind/ forward) it fails with the BadRequestException error mentioned above

Investigating the same.

gracektay commented 1 year ago

Hi @ashwinkumar6 I should have clarified this, we have polling for emergency alerts. We did remove that in our production environment due to the 4xx error issues. The polling is part of redux tool kit and works well for 99 percent of users. It's to check to see if there is an emergency on campus, so we want to check it regularly. (Plan to add it to an EventBridge approach later, but polling was a nice feature we wanted to use. Can't though due to this issue). The polling itself is to be expected, it's the actual failures that are the issue. It does make it easier to see the issue though so we turned it back on in dev type environments, although to be clear these issues happen with or without the polling on.

https://redux-toolkit.js.org/rtk-query/usage/polling

ashwinkumar6 commented 1 year ago

Ahh got it, that makes sense, thanks for clarifying that. So the issue isn't with the repeated calls (that's done by polling as mentioned above) but with handling a clock drift failure event. Let me discuss with the team and get back

gracektay commented 1 year ago

@ashwinkumar6 to an extent that’s true, the minute repeated successful calls are fine. But the failures with clock drift also cause repeated calls, an infinite number of them until the browser is force closed.

gracektay commented 1 year ago

The polling just offers more opportunities for the infinite failure to occur.

ashwinkumar6 commented 1 year ago

Thank you @gracektay If possible, could you please also share the Polling code snippet in the dev env that makes the GraphQl API call every 1min (specifically the graphQl API call)

gracektay commented 1 year ago

import { useGetEmergencyAlertsQuery } from "../../services/graphql/api";
import { getEmergencyAlerts } from "../../services/data-sources/enums";
import { Mui, MuiIcons } from "@osu/react-ui";
import { Alerts } from "./Alerts";
import { Links } from "./Links";
import { useStyles } from "../hooks/useStyles";
export const ErrorAlert = () => {
  const { data } = useGetEmergencyAlertsQuery(null, {
    pollingInterval: 60000
  });
  const emergencyAlerts = data?.[getEmergencyAlerts]?.list ?? [];
  const { background } = useStyles()
  if (!emergencyAlerts?.length) {
    return false;
  }
  const links = emergencyAlerts.map(({ links = [] }) => links).flat(1);
  return (
    <Mui.Alert
      severity="error"
      role="none"
      iconMapping={{
        error: <MuiIcons.Notifications fontSize="inherit" />,
      }}
      sx={background}
    >
      <Mui.Typography variant="srOnly" component="h2">
        Buckeye Alert Message
      </Mui.Typography>
      <Alerts alerts={emergencyAlerts} />
      <Links links={links} />
    </Mui.Alert>
  );
};
gracektay commented 1 year ago

It’d be a bit difficult to share the entire code set up but that’s the component and this is the base API call:


import { API, Auth } from "aws-amplify";
import { getSessionToken } from "../../util/functions";
import { AUTH_MODE } from "../../api-config";
import { isAfter } from 'date-fns'
export async function graphqlBaseQuery(args) {
  const { query, variables } = args;
  let authMode
  const sessionToken = await getSessionToken().catch()
  const { identityId, expiration } = await Auth.currentCredentials().catch() ?? {}
  const expirationDate = new Date(expiration)
  const nowDate = new Date()
  const isExpired = isAfter(nowDate, expirationDate)
  if(sessionToken) {
    authMode = AUTH_MODE.AMAZON_COGNITO_USER_POOLS
  } else if(identityId && !isExpired) {
    authMode = AUTH_MODE.AWS_IAM
  }
  let response = {}  
  try {
    if(!authMode) {
      return {
        error: {
          status: 500,
          data: 'Unable to authenticate credentials'
        }
      }
    }
    const result = await API.graphql({ authMode, query, variables });
    if (result?.data) {
      response = result;
    } else {
      throw new Error("Unexpected data format");
    }
  } catch (error) {
    response = { error: { status: 500, data: (error?.errors ?? error) } };
    if(error?.data) {
      response.data = error.data
    }
    return {
      error
    }
  }
  return response;
}

the expired bit was an attempt to fix the issue, but it made no difference.

Reposting these docs in case it’s helpful to understand the polling: https://redux-toolkit.js.org/rtk-query/usage/polling

AllanZhengYP commented 1 year ago

Hi @gracektay

Thank you for sharing the detailed information. They really helped the investigation! I just checked the site you provided earlier: https://d3rue5wy9qeptt.cloudfront.net, and I can reproduce the issue with following steps:

  1. Open the page and see successful graphql call every 1 min
  2. Set the local time to off by 1 hour
  3. The request failed with BadRequest error and the immediate retry succeeded with correct Date header. (The clockskew correction logic works as expected here)
  4. After a couple of successful call, when the local time is sync'd, the error starts to show. There are hundres of calls made almost without interval and they all failed with same BadRequest error.

I'll keep you post as the investigation continues.

AllanZhengYP commented 1 year ago

Hi @gracektay

I just confirmed this is indeed a bug with our clock drift correction function. Here's where it goes wrong:

  1. When server clock is 19:00:00 but the the client clock is 18:00:00, a request is initiated and failed with BadRequest error
  2. Error triggers the drift correction and client offset the clock by 1 hour. The retried request has Date header at 19:00:00 (client clock + offset)
  3. When the client clock is back to correct 19:00:00, the client offset is still 1 hour, the consequent request has Date header at 20:00:00 with the offset.
  4. The request failed with BadRequest error and trigger the drift correction. The function compares the request date at 20:00:00 and the server clock 19:00:00, and set the offset to -1 hour, then initiate a retry request.
  5. The retry request has local date of 19:00:00 and offset -1 hour, so the request Date header is 18:00:00 and hence failed with BadRequest. At this time the drift correction detect and set the offset to 1 hour again and trigger the retry. Then trigger the error in Step 3.
  6. As result, the request is in infinite retry because the request Date is flipped back and forth between 18:00:00 and 20:00:00 whereas the server clock is 19:00:00.

We will implement a fix for this issue and let you know when it's ready.

gracektay commented 1 year ago

Thanks so much @AllanZhengYP! I am happy to upgrade this cloud front url with any package versions you'd like to test, please let me know if there is anything on my side that can be helpful.

AllanZhengYP commented 1 year ago

@gracektay Hi, a fix has been released in aws-amplify@5.3.12. Can you give a try with this version?

gracektay commented 1 year ago

@AllanZhengYP Hi! I've updated https://d3rue5wy9qeptt.cloudfront.net. I have tested in Chrome and Safari, so far so good. I see it error once, retry and succeed. I'll let you know if I see more issues! Thanks SO much!

cwomack commented 1 year ago

@gracektay, glad to hear this is resolved with the fix that @AllanZhengYP introduced! We'll close this issue as resolved for now, but feel free to reply back if there's further problems tied to this.