AzureAD / microsoft-authentication-library-for-java

Microsoft Authentication Library (MSAL) for Java http://aka.ms/aadv2
MIT License
284 stars 142 forks source link

Correct password grant request gets throttled following an incorrect one #716

Closed akulyakhtin closed 10 months ago

akulyakhtin commented 11 months ago

The scenario is as follows:

1) Client uses MSAL to obtain an access token using ROPC public flow. It passes a correct username and an incorrect password.

2) Because the client is federated, MSAL eventually calls WSTrustRequest.execute(), passing username/password.

3) WSTrustRequest.execute() makes a SOAP request call to our server. It uses HttpHelper to send the request. The server sees an incorrect password and responds with HTTP response 500. (This might be an incorrect action on the server side but the issue concerns MSAL and not the server)

4) MSAL in HttpHelper.processThrottlingInstructions() sees response 500, creates request thumbprint and stores the thumbptint in ThrottlingCache and indicates that a request with the same thumbprint can only be performed after ThrottlingCache.DEFAULT_THROTTLING_TIME which is 120 sec.

5) Immediately after that the client issues another ROPC request, this time with the correct password.

6) This new ROPC request also results in the call to WSTrustRequest.execute() and MSAL then wants to make a call to the server using HttpHelper.

7) However, HttpServer.checkForThrottling() sees that the thumbprint of the new request is the same as that of the previous request made by WSTrustRequest.execute() (because urls are the same) and throws MsalThrottlingException and the correct request does not get sent.

We think it's a bug because in this scenario an earlier incorrect request prevents a successful execution of a later correct request .

bgavrilMS commented 11 months ago

Great analysis of the issue.

Indeed MSAL Java will throttle similar requests if the server responds with HTTP 5xx instead of the expected HTTP 4xx which denotes an invalid request.

We will not fix this. ROPC should not be used except in test scenarios, in which case you store the password in a vault and it's never wrong.

manuelschwarze commented 11 months ago

Could it be considered to workaround the described issue by allowing the application that uses MSAL to configure the default threshold value to be used when a server returns a 500er response? This is making the MSAL library unusable for the moment, unless we have our own fork and personal version, which I would really not like to do. I am not really sure what the library logic is trying to solve by having a fixed default for this situation. It cannot really be security driven, because if I am acting as a hacker, I just download the library, set the default threshold to 1 ms and are happy with hacking the system. What else is the reasoning behind it? BTW, I disagree with what you are saying about ROPC, @bgavrilMS. The testing scenario is only one of the scenarios, there are other scenarios in real life that are requiring ROPC until everything of the application landscape of a company has been fully migrated to use Azure AD OAuth, and this will take years and will not happen over night.

bgavrilMS commented 11 months ago

The threshold there is trying to stop clients who set aggressive retry policies from making calls to the server when the server replies with "I am busy" or "I am having problems", i.e. 5xx error codes. We can decrease that threshold, maybe to several seconds.

Can you elaborate on some scenarios for using ROPC ? We are putting a lot of effort into interactive based authentication, such as ensuring device ID gets injected into the session, conditional access policies etc. In fact dropping support for ROPC entirely is a frequent topic.

Ideally you'd fix your ADFS server to reply with 4xx in this case. Or create a PR with a fix.

manuelschwarze commented 11 months ago

I totally agree: Fixing the ADFS server is first priority, but it is out of our control when and if it will happen. I would be in favor of your suggested solution to reduce the default to a few seconds instead of 2 minutes. Making it somehow configurable is still the most flexible option in my opinion. You asked about ROPC scenarios: Imagine a larger enterprise with systems that are able to perform work for a user in many different scenarios (incl. unattended processing). Such a system - let's call it System A - takes in the "old world" the user credentials, saves them secured and uses them on behalf of the user, e.g. when the user scheduled some automated process on such a system. System A is calling into other systems, web services, databases using sometimes security tokens derived from the user's credentials, but sometimes also the plain credentials of the user. Let's call them System B, System C and System D. Now, imagine we want to change the landscape to Azure AD OAuth. For a large organization this cannot happen as a big bang approach, but needs to happen step by step. Just think of it not as 4 systems, but hundreds. At the very end, every system (in theory) could integrate Azure AD OAuth without using any username and password anymore. But on the way to this goal, it will still be necessary to collect username and password by System A as long as there is one system it has a dependency on that requires it. Some systems are vendor applications, where the enterprise does not have full control about the road-map - again a reason for delays on the way to be perfect. For this purpose ROPC is the (only reasonable and working) temporary solution in my opinion. So, while I am completely understanding Microsofts efforts to work towards ROPC alternatives, it is important that people who are working on cutting edge do not drift into a dream world, but always also consider real-life scenarios and challenges and migration paths.

bgavrilMS commented 11 months ago

Thanks for explaining the scenario @manuelschwarze

Adding public API for this might more flexible, but it also adds maintenance costs and developer fatigue. I mean you know what this does because you spent a long time looking at MSAL Java code, but a new developer looking into all the MSAL config options would quickly become overwhelmed.

Thinking about this some, I may have a better solution to your problem. All MSAL libraries, including MSAL Java, have extensibility for their HttpClient. This should allow you to monitor traffic and to change the HTTP status from 500 to 400 at the app / MSAL level. No changes in MSAL4J needed, this extensibility point exists today.

See https://learn.microsoft.com/en-us/entra/msal/java/advanced/configure-http-client

manuelschwarze commented 11 months ago

That's a valid alternative, but it would be a hack. :-) But as a workaround it would work: Catching the 500er response, switching it to a 401 response and setting the throttle default value that we define ourselves. I would still prefer the option that we change the default value in the library from 120 seconds down to a few seconds. How about 5 seconds?

In one of the regular scenarios I have to deal with I have a process that is shared by many (unattended) users - this process has secure access to their username and passwords. The request hash that the throttle detection system is using does not take into account the user id of the caller that failed authentication, but only authority and client id. This causes the issue, that if just 1 out of 10 users in that process updated his password on Windows, but forgot to update it in the scheduled unattended process, then the login will fail for that user, but afterwards all logins of the other 9 users would also fail with the MsalThrottlingException, which delays all the work by the throttling time that is defined.

Therefore, I would be in sync with your suggestion, @bgavrilMS, and also suggest to decrease the current default to 5 seconds. This is still a good and reasonable default throttling time to prevent unintended brute force attacks, but will not be a blocker for my scenario. Intended brute force attacks cannot be prevented anyway on client side as I explained above. Thanks for considering making that change.

bgavrilMS commented 11 months ago

Sounds good.

If you want to provide PR it will certainly accelerate things.

akulyakhtin commented 11 months ago

Hi @bgavrilMS When trying to submit a PR I'm getting 403 error 'Permission to AzureAD/microsoft-authentication-library-for-java.git denied to akulyakhtin.' Could you, please, refer me to the procedure of submitting a PR?

Instead of PR, I then have created a patch attached to this comment. default-throttling-time-5sec.patch

bgavrilMS commented 11 months ago

@akulyakhtin - I sent you an invite - please accept and try again.

Typically folks fork the repo and propose a PR from their repo..

akulyakhtin commented 11 months ago

Hi @bgavrilMS, I've submitted a PR https://github.com/AzureAD/microsoft-authentication-library-for-java/pull/721 I can see that the integration tests are failing in other PRs as well, so I believe those failures are not related to my changes.

Avery-Dunn commented 11 months ago

Thanks for setting up that PR, I just merged it in. It will be in our next release (likely sometime next week), and I'll update this thread once that's done.

akulyakhtin commented 10 months ago

Hi @Avery-Dunn, could you, possibly, update regarding the schedule for the release of this fix?

Avery-Dunn commented 10 months ago

@akulyakhtin : Sorry for the delays on this, we were blocked by some internal build issues. Your fix was merged in and just released as part of version 1.14.0, thanks again for the PR you made!