Issues with the refresh endpoint endlessly redirecting after signin

HudsonAkridge commented 2 years ago

We're using v2.1.0 (vanilla, nothing fancy/special). Occasionally, a user will get stuck in this redirect loop. I suspect it's when the JWT expires, and a lazy evaluation happens. Not all the time, but sometimes, it will put the browser into this constant back and forth redirect ping thing that keeps showing the message/image displayed below. If you look at the XHR request stuff, you'll see that it's just endlessly redirecting from the /refresh endpoint, to the cognito /auth and /login endpoints. I can post the XHR's (once I sanitize them), if needed.

This might be due to a refactor done recently for this version that has the JWT refreshed lazily perhaps?

I've tried with the JWT expiration set to 1 hour, 12 hours, 24 hours, and 5 minutes and the same thing. It just changes how fast the cycles go before this issue manifests/gets recreated.

For the end user, there's no solution other than clearing cookies to correct this problem. Once they sign in again, the problem goes away for several refreshes, some unknown number (can't recreate it consistently, it appears to be a timing thing) before it hits again.

I'm going to try reverting back to a previous version of the app (v2.0.19) to see if this has any positive impact.

ottokruse commented 2 years ago

Thanks for reporting this.

What are your values for parameters enableSPAmode and cookieSettings?

To debug, probably best to set loglevel to DEBUG for a bit, if that's acceptable to you (logs all requests, include sensitive data). That will show the responses generated by the Lambda@Edge functions––would be good to see if the checkAuth actually did set the nonce, and whether that nonce was sent by the browser to the refresh endpoint. You can also check that in the network panel.

I can post the XHR's (once I sanitize them), if needed.

That would be great.

HudsonAkridge commented 2 years ago

EnableSPAmode = true CookieSettings = { "idToken": null, "accessToken": null, "refreshToken": null, "nonce": null }

I do have some debug logs that should have been captured. Just for the /refreshauth lambda though. Let me see if I can get those (this has been a problem for the last week, we've been trying to figure out the cause).

Also, it'll take a few for me to get those XHR's, but will do. I'll post here with additional data.

So far, for what it's worth, I rolled back to v2.0.19 and haven't had this issue show up in the last 24 hours. Except the initial time when I rolled it back and changed the JWT/ID token expiration from 24 hours back to 60 minutes. But, since then, no issues reported.

ottokruse commented 2 years ago

Those settings are vanilla and should work. You mentioned "Occasionally" which is interesting -- any clue as to the precise scenarios where this does or does not occur?

HudsonAkridge commented 2 years ago

@ottokruse Re: Scenarios. Unfortunately there's not an easy way to recreate it. What I do know are the following based on pulling a few different levers:

It happens no matter what the JWT/ID expiration is set to in the Cognito client
If the JWT/ID token expiration is set to longer (e.g. 12 hours/720 minutes), it takes much longer to recreate the issue, but it always seems to happen
It appears to happen whenever the JWT/ID Token expires, when the user hasn't hit the site in a while. e.g. they were signed in yesterday morning, the token lasted 12 hours, and now they try and refresh the site the next morning, that seems to trigger it more often than other patterns.
If they're using the site continually, it doesn't seem to happen, it appears to mostly be after letting it sit for a bit past the JWT/ID token expiration.

HudsonAkridge commented 2 years ago

@ottokruse This took a lot longer than I expected to sanitize the .har file 🙂 but finally got it done. I've redacted anything that could identify us, but I don't think anything I've removed would make it more difficult for you to see the issue.

LMK if I'm wrong about that and you need some bit of data that was redacted. If it's not an identifying piece of data for the user or our app, I'd be happy to add it back.

I had to change the type to .txt for it to upload to here so you'll have to change it back to .har 🙂 redirects-issue-redacted.txt

HudsonAkridge commented 2 years ago

The largest issues with the redirect loop are:

A normal user is unlikely to notice this and it'll just keep redirecting in the background forever. Hundreds/thousands of times until they stop the cycle
The Sign-In button is broken and won't actually work to let them sign back in.
Clearing cookies is the only way to solve this right now we've found, which is obviously not ideal.

We've got a handful of power-users that are going to test our app with v2.0.19 over the weekend. I'll let you know if we see the same problem there. But so far, no issues. Only with v2.1.0 (which could make sense, as a lot changed with the refreshauth logic in there it looks like)

ottokruse commented 2 years ago

Thanks that was very helpful.

I can see the redirect loop, going back and forth to /refreshauth, until chrome throws net::ERR_TOO_MANY_REDIRECTS. That then leads to another trip to /refreshauth but this time without cookies. That would then show the error page you pasted the screen shot of above.

The root error is the redirect loop. Interestingly, from the HAR file the refresh seems to work okay:

User is redirected to /refreshauth (apparently because JWT expired)
User is redirected back to /, with new JWTs in the response.

All good, but then checkAuth decides again to send the user to /refreshauth.

Need to look closer into the reason for that. Best next step is to enable info logging, that will show the exact reason why checkAuth thinks the JWT is invalid in the logs: https://github.com/aws-samples/cloudfront-authorization-at-edge/blob/3a79edd78fc4646f7120088bab065e3a90dcbe8b/src/lambda-edge/check-auth/index.ts#L40 Have a look at what the error says exactly?

You didn't add the JWTs in the HAR (makes sense) but it would be good to have a look at them and maybe run them through jwt.io and verify with your own eyes that the JWT's in the cookies in the response from /refreshauth are good valid and not expired.

ottokruse commented 2 years ago

The Sign-In button is broken and won't actually work to let them sign back in.

Can you share more details on this too? Maybe there is a clue here. Just tested again on 2.1.0 and for me this button works fine: it signs you out under the hood and redirects to the Cognito Hosted UI to sign in again. Related: what setting do you have for parameter SignOutUrl (the default value is /signout)

HudsonAkridge commented 2 years ago

The Sign-In button is broken and won't actually work to let them sign back in.

Can you share more details on this too? Maybe there is a clue here. Just tested again on 2.1.0 and for me this button works fine: it signs you out under the hood and redirects to the Cognito Hosted UI to sign in again. Related: what setting do you have for parameter SignOutUrl (the default value is /signout)

The value is set to /signout, as the default. And that's what the Cognito HostedUI has. It shows: https://env.my-site.com/signout as the URL for the sign in button in the cognito UI, and this doesn't work at that point. However, the signout button/URL we have in the app that takes you to the same URL does work if you're not currently dealing with the refresh auth redirect issue.

We are using ReactJS and Amplify here, and I have the amplify config set to use cookies. Which does work, it bypasses the amplify sign in page.

I could turn off amplify from the UI if needed for testing.

After redeploying v2.0.19, we're still seeing the same thing eventually. Just had a lot of users experience this issue over the weekend and report it, it just seems to happen less frequently.

The scenario that caused it was one of our users had their computer update and reboot overnight, and when they opened up the page this morning, it triggered a "sea of redirect problems".

My logging was reverted after the 2.0.19 deploy, so I'll update it to at least the INFO level and see if I can't get you exactly what the errors are saying.

I'll also decode the JWTs from that redacted .har I uploaded and see if the timings for the refresh auth endpoint are not being expired prematurely.

ottokruse commented 2 years ago

My logging was reverted after the 2.0.19 deploy, so I'll update it to at least the INFO level and see if I can't get you exactly what the errors are saying.

Cool.

What cookie settings do you have in your Amplify config? This is a thing that changed in v2.1.0, recommend now to have config like this (to use host only cookies):

cookieStorage: {
      domain: " ", // Use a single space " " for host-only cookies
      expires: null, // null means session cookies
      path: "/",
      secure: true, // for developing on localhost over http: set to false
      sameSite: "lax"
    },

Also might be wise to have lines like this in your web app, to let Amplify do the refresh: https://github.com/aws-samples/cloudfront-authorization-at-edge/blob/3a79edd78fc4646f7120088bab065e3a90dcbe8b/src/cfn-custom-resources/react-app/react-app/src/App.js#L44-L46

bedaka commented 2 years ago

@HudsonAkridge could you finally solve this issue? As it seems like we're experiencing the same problem I would appreciate any hints how to approach this.

ottokruse commented 2 years ago

Hi @HudsonAkridge ! Did you ever get to the bottom of this? Did the logs reveal anything usefull?

ottokruse commented 2 years ago

Closing for now, can re-open if further info surfaces

HudsonAkridge commented 1 year ago

Hi @ottokruse , long time no chat 🙂

We had some other stuff come up and just got around to giving this another go, thinking maybe something updated in the last 6+ or so months to solve the problem.

Did another deploy with the latest (2.1.5) and ran into the same issue (see attached XHR/Network logs)

So, still not sure what's happening/going on here.

What I'm planning to do tomorrow is:

Do a full on vanilla deploy of this stack, without any parameters passed in other than what's strictly necessary
Let that create the Cognito pool and all other resources needed
Set up a few default/basic users
Use the default URIs and everything for doing the auth
Drop in our ReactJS app into the site's S3 folder instead of using the demo app
Try and resolve whatever CORS/config issues come up as a result of that
Test again

The only thing that seems non-vanilla about our setup, is our pre-existing custom configured Cognito pool and how we've got some custom domains set up with Amplify.

So the goal is to remove as many variables from our setup as possible to create a control and see if we can recreate the issue then. It just doesn't make any sense that we're the only ones running into this with (potentially) thousands of consumers of this system.

Thoughts?

ottokruse commented 1 year ago

Hi @HudsonAkridge :) Great to chat again though too bad the issue is still there then.

So the goal is to remove as many variables from our setup as possible to create a control and see if we can recreate the issue then.

Great plan. And do set logLevel to DEBUG then when you deploy the stack.

Also, share the Amplify config? Eg the stuff you put in Amplify.configure(...) for Auth? (redact sensitive parts)

HudsonAkridge commented 1 year ago

Hi @ottokruse Just as an update, I think we figured the issue out that was causing us so much pain and misery :)

I hope to have a write-up here soon to help the next person that comes along with a similar issue.

Right now, I have sort of a question from a little out in left field, which is: I'm thinking of getting rid of the AmplifyJS library used in the ReactJS app, as this repo's functionality gets me an idToken to grab and attach to our auth headers as well as restricts access to the site.

The only issue I'm having now is that once the idToken expires, it takes loading a route in the page/app again to hit the /refreshauth route which will refresh the idToken.

Now, I could do that/negotiate that with Cognito directly here, like the /refreshauth lambda does to get an updated version of the JWT, but that feels dirty.

So I was trying to figure out how to get the cookie updated in the background while the user is making API calls to a different sub-domained endpoint (e.g. if the app is www.example.com all of the API routes are api.example.com, and was wondering if you had any thoughts/ideas there?

So far, I've got:

Try loading some route in the ReactJS app with a call in the background about 30 seconds before the expires is set to fire off
Negotiate the tokens myself

Option 2 is the less attractive route as I would like to re-use the behavior your great library here is set up to use. We just need to force some page in the www.example.com space to load before the token expires, but in the background.

At least, that's my current thought, but very much open to input/suggestions here.

Thank you for all of your work/help with this, and I'm super excited it looks like we might finally get to use this for real (once we solve this problem that is).

HudsonAkridge commented 1 year ago

Here's what I've done for the moment to try and alleviate the issue. Added a check to do a background no-cache refresh of a simple empty JSON file with a cooldown of 15 seconds before I get the idToken value from Cognito. Example code:

import Cookies from "universal-cookie";

const cookies = new Cookies();
const getIdTokenCookieValue = (obj, filter) => {
    let key = [];
    for (key in obj){
        if (Object.prototype.hasOwnProperty.call(obj, key) && filter.test(key)) {
            return key;
      }
    }
    return null;
  };

const backgroundRefreshCooldownInMs = 15 * 1000; //15 seconds
let lastBackgroundRefreshTimestamp = null;

export const getUserToken = async () => {
    if(!lastBackgroundRefreshTimestamp || (Date.now() - lastBackgroundRefreshTimestamp) >= backgroundRefreshCooldownInMs)
    {
        //We want to load a potential background token refresh so the next token we get is updated
        lastBackgroundRefreshTimestamp = Date.now();
        let backgroundLoadResponse = await fetch("backgroundLoad.json", {cache: "no-store"});
    }

    let allCookies = cookies.getAll();
    let cognitoIdTokenCookieName = getIdTokenCookieValue(allCookies, /idToken/);
    let idToken = allCookies[cognitoIdTokenCookieName];

    return idToken;
};

I've got a pretty long expiration set for the idToken, so I'll report back in the morning if this works the way we want. If it does, people are welcome to use my code above for their needs if they're also getting rid of Amplify in their SPA.

My goal here is to get off Amplify completely for our ReactJS app. Amplify was nice as a boostrap in the beginning, but it's not particularly flexible and has some other weird oddities with it. It's simpler to just use this repo deployment and the above code to get the idToken and refresh in the background without needing the AmplifyJS dependency/complexity/UI.

ottokruse commented 1 year ago

Just as an update, I think we figured the issue out that was causing us so much pain and misery :)

🎉

I hope to have a write-up here soon to help the next person that comes along with a similar issue.

Awesome. I'm pretty curious.

Try loading some route in the ReactJS app with a call in the background about 30 seconds before the expires is set to fire off

I like that idea.

If backgroundLoad.json is a 0-byte file, then it may be okay enough to be very dumb and simple about it and just add:

// Do a dummy fetch every minute, to refresh tokens (which is a no-op if they don't need refresh)
setInterval(() => fetch("backgroundLoad.json", {cache: "no-store"}), 1 * 60 * 1000)

UPDATE: this leaves a chance of the tokens becoming expired within the first 1 minute, for returning visits where the user already has cookies that might almost expire (within 1 minute).

Set the cache-control header on backgroundLoad.json on S3 to e.g. public,max-age=31536000,immutable so CloudFront caches that file nice and long. Then the only price you're paying for the call every minute, is Lambda@Edge running––if that's bearable depends on the amount of users on your site.

But your approach of only refreshing if needed, is certainly more sophisticated.

ottokruse commented 1 year ago

One more thought.

If I understand your situation it is roughly this:

You host a website (SPA) using this Auth@Edge solution
Your website's JavaScript calls APIs for which it uses the JWTs from the cookies
The APIs are not fronted by the Auth@Edge CloudFront and to call them you need valid JWTs, therefore you're looking for a method to trigger their refresh

If that's indeed true, then a good way forward would be to change 3 above: front the APIs also with the Auth@Edge CloudFront. Then refreshes will be seamless our of the box, as you've seen Auth@Edge redirects automatically to the refresh endpoint and back to the requested location. So then you don't need to trigger token refresh manually in your SPA. Side benefit is that this also eliminates CORS preflight requests, so better latency.

jimboboliver commented 1 year ago

@HudsonAkridge I would love to hear about how you solved the problem because we've also been experiencing something similar to this for a long time. We have a similar setup to what you described with the pre-existing custom configured Cognito user pool being fed into the CloudFormation deployment.

HudsonAkridge commented 1 year ago

@ottokruse @james1050 I've verified that this works with everything, and I was able to remove Amplify from the configuration and it's all good!

I'll try and briefly type up the approach with a more detailed explanation later when I have more time. For now, this is some o fthe things done to make this work starting from a fresh deploy.

Set the ARN for cognito, add the ClientId of your app pool, and set the Auth URL for Cognito to whatever the auth deployment endpoint is, in this case something like auth.yourdomain.com works for me.
In the HttpHeaders section, REMOVE the Content-Security-Policy header section of the JSON completely. Leave the others in place.
For the cookieSettings, the cookies need to change to become:

  "cookieSettings": {
        "idToken": "Path=/; Secure; Domain=yourdomain.com; SameSite=Lax",
        "accessToken": "Path=/; Secure; Domain=yourdomain.com; SameSite=Lax",
        "refreshToken": "Path=/; Secure; Domain=yourdomain.com; SameSite=Lax",
        "nonce": "Path=/; Secure; HttpOnly; Domain=yourdomain.com; SameSite=Lax"
    }

This should be the setting for all configuration.json files after deploy. You might need to manually update the Lambdas that have this and redeploy (RefreshAuthHandler, ParseAuthHandler, SignOutHandler, HttpHeadersHandler, CheckAuthHandler) if it doesn't show as correct like the above.

Leave defaults for all other options.

Now, in the frontend client code. Remove Amplify. Like, all of it. You shouldn't need it because this repo @ottokruse and co have built pretty much does it all for you.

Create or modify the method used to get the JWT for your application's headers (in my case, I need the JWT to attach to a call to api.yourdomain.com with Authorization: "Bearer <jwt>"). Here's the final code that's worked well for me in testing/trialing so far:

import Cookies from "universal-cookie";
import jwt_decode from 'jwt-decode';

const cookies = new Cookies();
const getIdTokenCookieValue = (obj, filter) => {
    let key = [];
    for (key in obj){
        if (Object.prototype.hasOwnProperty.call(obj, key) && filter.test(key)) {
            return key;
      }
    }
    return null;
  };

const rateLimitInMs = 15 * 1000; //15 seconds
const refreshTokenWindowInTicks = 15; //15 seconds
let lastBackgroundRefreshTimestamp = null;

export const getUserToken = async () => {
    let allCookies = cookies.getAll();

    //Calculate the idToken expiration. If this fails for some reason, we should be reloading/re-signing in to the app completely
    // so an error at this point should halt execution
    let initialIdTokenCookieName = getIdTokenCookieValue(allCookies, /idToken/);
    let initialIdToken = allCookies[initialIdTokenCookieName];
    let decodedToken = jwt_decode(initialIdToken);
    let initialIdTokenExpires = decodedToken.exp;
    let currentTicks = Math.floor(Date.now() / 1000); //Need to convert JSdate to Unix Ticks for comparisons
    let remainingIdTokenExpirationTicks = initialIdTokenExpires - currentTicks;
    let isWithinTokenExpirationWindow = remainingIdTokenExpirationTicks <= refreshTokenWindowInTicks;

    //Rate limit calls to doing a background fetch so we don't hammer this in case someone is impatient and keeps clicking
    let isNotRateLimited = !lastBackgroundRefreshTimestamp || (Date.now() - lastBackgroundRefreshTimestamp) >= rateLimitInMs;

    if(isNotRateLimited && isWithinTokenExpirationWindow)
    {
        lastBackgroundRefreshTimestamp = Date.now();
        let backgroundLoadResponse = await fetch("backgroundLoad.json", {cache: "no-store"});
        console.log("Refreshed page in background.");
    }

    //This may be a different idToken cookie value based on the Cognito Hash, we can't be sure it will match the one we retrieved above
    let currentIdTokenCookieName = getIdTokenCookieValue(allCookies, /idToken/);
    let idToken = allCookies[currentIdTokenCookieName];

    return idToken;
};

I'm not saying that the code above is the most optimized way to do this, or the cleanest, just that it's fairly simple and it works.

Also, add the backgroundLoad.json file to your site, and include a .bundleignore file at the root of your project (if you're using Yarn that is) and just put a single entry in it for backgroundLoad.json. This should prevent it from bundling that JSON file in with other javascript/json stuff so your file size is kept at small as possible. You're only going to be refreshing that when it's close to time for the token to expire, so it's not going to be creating much traffic unless your token is set to expire insanely fast for some reason.

What should happen is that if a token is expired, it should do a fetch of that json file behind the scenes which will load the JSON file, hitting the route, firing off the /refreshauth lambda. Once that's complete, it'll fetch the updated idToken value and use that JWT value until it's set to expire or close to an expiration.

This prevents hammering anything on the server and only executes it once, when needed.

Occasionally an individual request will fail on the background fetch, and you just have to click a link somewhere else in your app again (assuming it's always calling this getUserToken method for every authentication related call), I haven't been able to totally solve for that yet, but I haven't had a lot of time to dig in to looking at it.

So far, so good.

I hope this helps someone else 🙂

ottokruse commented 1 year ago

Cheers and thanks for the explanation, happy it's sorted for you!

@james1050 if this doesn't help you, I can look along with you. Send me a har file of the situation at ottokrus at amazon dot nl

JaysenWankhade commented 5 months ago

I am too facing the same issue, I tried with the latest version of the code as well as version 2.0.7. My configuration is as follows. EnableSpaMode = false, Version = 2.07 as well as 2.0.9. We are hosting multiple static websites & serving it over the cloud front. When we Lauch multiple URL's in quick succession we get this error. Or sometimes even when we launch a standalone URL this happens. It says we can't refresh your sign-in automatically because of a technical problem also, redirected too many times, please try deleting your cookies. Please help!

ottokruse commented 5 months ago

When we Lauch multiple URL's in quick succession we get this error. Or sometimes even when we launch a standalone URL this happens.

Hi mate. Can you be a bit more exact? Can you explain the steps to reproduce?

And what is the technical problem? If you click on it, what does it say? Is it the same message reported at the start of this issue?

JaysenWankhade commented 5 months ago

When we Lauch multiple URL's in quick succession we get this error. Or sometimes even when we launch a standalone URL this happens.

Hi mate. Can you be a bit more exact? Can you explain the steps to reproduce?

And what is the technical problem? If you click on it, what does it say? Is it the same message reported at the start of this issue?

Thanks for quick reply, well here is what is happening.

2 years back we have deployed an instance of this solution in Static Website mode. In S3 we store multiple sites (Simple Static Html's). Each website can be then accessed in below manner.

myexampledomain.com/site1/index.html myexampledomain.com/site2/index.html . . . myexampledomain.com/siteN/index.html

Currently, we have deployed another instance of this solution with latest codebase in another AWS account. Current configuration is like as follows:

Version: 2.1.9 HttpHeaders: { "Content-Security-Policy": /our own applicable csp/ "Strict-Transport-Security": "max-age=31536000; includeSubdomains; preload", "Referrer-Policy": "same-origin", "X-XSS-Protection": "1; mode=block", "X-Frame-Options": "SAMEORIGIN", "X-Content-Type-Options": "nosniff", "Cache-Control": "max-age=300, must-revalidate" } EnableSPAMode: false CookieSettings: { "idToken": null, "accessToken": null, "refreshToken": null, "nonce": null } My App Client has ID token expiration of 5 minutes.

So, after launching the any of the URL mentioned (this happens randomly) after 5 mins of inactivity randomly one of the url fails to get the idToken refreshed it says "myexampledomain.com redirected you to many times, try deleting cookies" & eventually land on the error page saying "Your browser didn't send the nonce cookie along, but it is required for security (prevent CSRF). [log region: ap-southeast-1]" Sometimes this even causes a problem in an active session as well. Not able to understand what's going wrong here.

Also, I have one more question, in past versions of this codebase, Check Auth function used to have this piece of codebase. If the token has (nearly) expired and there is a refreshToken: refresh tokens const { exp } = decodeToken(idToken); if ((Date.now() / 1000) - 60 > exp && refreshToken) { Which is not there in latest code base, so not sure how refreshing token is being handled in latest code. Hoping for your reply :)

ottokruse commented 5 months ago

Can you share a HAR file perhaps so I can see in detail what goes on? Send it to me directly via email.

Which is not there in latest code base, so not sure how refreshing token is being handled in latest code.

JWT refreshing is now lazy, it's only done when the JWT actually expires. (reason was that with a small expiry window for JWTs, that some people configure, you could run into infinite refreshes with the previous code, that pre-emptively refreshed)

Anyways this is proving to be tricky problem and the solution will be to monitor the browser network tab (and record a HAR please) and looking at the CloudWatch logs at the same time, so you can see why they do what they do (set logging to DEBUG for this).

ottokruse commented 5 months ago

Looking closely at the network tab screenshot from @HudsonAkridge I can see that it is actually favicon.ico that triggers the cycle of redirects. That is of course silly (and maybe we should special case favicon.ico to not trigger refreshes to begin with).

If I look at the screenshot, the 1st request (is that in response to a redirect?) to refreshauth (first line in the screenshot) succeeds and comes back with HTTP 200. But, that means it is actually the error page, because if the JWTs would have been refreshed successfully, you'd see a 307 instead (easy to understand from the code here: https://github.com/aws-samples/cloudfront-authorization-at-edge/blob/master/src/lambda-edge/refresh-auth/index.ts).

What confuses me is that upon loading the error page, favicon.ico is apparently fetched, and it triggers the redirect to refreshauth also, which now works because I see a 307 (!)

But even though it works, it's cookies+JWTs are apparently not persisted, or something else is wrong, which is why the subsequent request to favicon.ico triggers the redirect to refreshauth again, which returns 307 (so tokens must again be refreshed) but still doesn't work, and so on until the browser has had enough of it.

This is one way to rack up the number of calls to Cognito! We should do something about it.

Questions I have, which can probably only be answered by going through the HAR file AND looking at the LambdaEdge logs side by side:

Why does the initial request to refreshauth return the error page, meaning the refresh didn't work? (Because all subsequent refreshauth requests succeed –– I see 307s there)
Why does the user navigate to refreshauth directly (1st request in the whole sequence)? I would never expect the first request to a website to go to refreshauth. You would only go to refreshauth because you get redirected there because of JWT expiry. (--> maybe this is the real culprit that we should fix)
Why are the cookies from all subsequent 307s not persisted? It's like cookies set in responses to favicon.ico requests are ignored by the browser? (maybe they have also special cased favicon.ico haha)

ottokruse commented 5 months ago

Users should never go directly to /refreshauth, they go to wherever they want to go, get redirected to /refreshauth where the JWT refresh magic happens, and then get redirected back to where they came from.

Users should never see the /refreshauth URL, if all goes well the redirect to there and back to where they came from is so fast they do not even notice it. But if the error page is displayed, then they would see it.

Maybe we should detect if the user goes to /refreshauth directly (without being redirected) and then redirect them back without trying the refresh. There is no foolproof way for this that I know of, but checking the absence of the Referer header could be pragmatic enough.

Also, I kinda like not triggering refresh for favicon.ico.

But let's not jump the gun and first get to the bottom of why this happens.

JaysenWankhade commented 5 months ago

Looking closely at the network tab screenshot from @HudsonAkridge I can see that it is actually favicon.ico that triggers the cycle of redirects. That is of course silly (and maybe we should special case favicon.ico to not trigger refreshes to begin with).

If I look at the screenshot, the 1st request (is that in response to a redirect?) to refreshauth (first line in the screenshot) succeeds and comes back with HTTP 200. But, that means it is actually the error page, because if the JWTs would have been refreshed successfully, you'd see a 307 instead (easy to understand from the code here: https://github.com/aws-samples/cloudfront-authorization-at-edge/blob/master/src/lambda-edge/refresh-auth/index.ts).

What confuses me is that upon loading the error page, favicon.ico is apparently fetched, and it triggers the redirect to refreshauth also, which now works because I see a 307 (!)

But even though it works, it's cookies+JWTs are apparently not persisted, or something else is wrong, which is why the subsequent request to favicon.ico triggers the redirect to refreshauth again, which returns 307 (so tokens must again be refreshed) but still doesn't work, and so on until the browser has had enough of it.

This is one way to rack up the number of calls to Cognito! We should do something about it.

Questions I have, which can probably only be answered by going through the HAR file AND looking at the LambdaEdge logs side by side:

Why does the initial request to refreshauth return the error page, meaning the refresh didn't work? (Because all subsequent refreshauth requests succeed –– I see 307s there)

Why does the user navigate to refreshauth directly (1st request in the whole sequence)? I would never expect the first request to a website to go to refreshauth. You would only go to refreshauth because you get redirected there because of JWT expiry. (--> maybe this is the real culprit that we should fix)

Why are the cookies from all subsequent 307s not persisted? It's like cookies set in responses to favicon.ico requests are ignored by the browser? (maybe they have also special cased favicon.ico haha)

Hi Mate,

I tried something, I took a pull of code where lazy refresh doesn't happen & just tried to update ChekAuthHandler with that piece of code & things seem to be working now. I suspect the issue has something to with the lazy refresh (although can't say for sure, but the behavior of things does align).

Why does the user navigate to refreshauth directly - I don't think user does that, In my case, after 5mins when the token expires & post that when I visit to the site, the issue occurs.

Hope this helps!

ottokruse commented 5 months ago

Yeah it is likely, the lazy refresh was introduced in v2.1.0, which is the 1st release in which this issue was reported.

But I still want to understand why this happens now with the lazy refresh before making any fixes.

ottokruse commented 5 months ago

Just tried to reproduce it myself (against a fresh and vanilla deployment out of the Serverless Repo) but the refresh works flawless for me:

Will be hard to fix this without a reliable way to reproduce it :|

@JaysenWankhade can you check if in your case also the 1st refresh goes wrong (shows error page) after which favicon.ico triggers the loop?

ottokruse commented 5 months ago

Yeah it is likely, the lazy refresh was introduced in v2.1.0, which is the 1st release in which this issue was reported.

Wait you mention also seeing this in 2.07 and 2.09, if so, it can't be the lazy refresh then

ottokruse commented 3 months ago

FYI tweaked the way refreshing works in #271 which is released as part of v2.3.0

There's a chance that fixes this issue as well.

vincentclee commented 2 months ago

After deploying v2.3.0, users are still seeing this issue occur:

ottokruse commented 2 months ago

TL;DR this is still an open issue that occurs sometimes for some users and is hard to reproduce in vanilla environments.

In order to debug this it would be crucial to have:

a HAR file that shows the flow of requests
the associated logs from the Lambda@Edge functions in DEBUG mode

If we have these side by side, the root cause should reveal itself!

If you can provide this info but are concerned with posting them here publicly feel free to email me directly please.

aws-samples / cloudfront-authorization-at-edge

Issues with the refresh endpoint endlessly redirecting after signin #190