Closed fgrtl closed 5 months ago
the reason that the result can still be found in the database is a built in retry-strategy.
If either an OptimisticLockException
or StaleObjectStateException
is thrown during processing the RetryStrategy will open a new transaction and try to repeat the operation from start. This is done up to 3 times. If an error occurred in the previous calls it is still logged.
So in your case the second try of the operation will probably have succeeded.
Does this answer the problem?
Thanks for the answer but how is entra handling this with a 500 response? And how does this effect the entra state (watermark)?
Hmm... I am not able to reproduce this behaviour with a 500 response. I have some integrationtests that provoke this behaviour. A concurrent modification of the same resource in 100 requests and with an assertion that each request must return response-code 200. In cases in which I get a 500 response beause the retry-counter has reached its limit and the operation still fails, the operation is not saved in my case...
Could you give me some more details or maybe provide your mock-implementation? This might help to pinpoint your problem.
Sure! I have also tested it with KC18 and received a 500, but as you predicted, the user is not added to the group in KC18.
I checked the first comment again and I am a little bit confused. How are you testing with kc-18? The plugin from https://scim-for-keycloak.de does not have support for Keycloak version before 20.
The kc-18 plugin is still based on the OpenSource version from github that you can download here and this version does not have handling for such errors. So, I am not quite sure where this is heading now. From my point of view the error is handled to a certain degree in the scim-for-keycloak plugin at https://scim-for-keycloak.de. But the error cannot be handled absolutely reliably because I have limited the numbers of retries. It is just less error-prone because it retries it. The ideal way is telling the client not to send a lot of update-requests on a single resource. SCIM is defining the Bulk-Endpoint and allows several operations on patch to handle this.
But how does the kc-24.2.0.0 version into this? Do you have the exactly same problem with this version as with the OpenSource version? I cannot reproduce this with version kc-24-2.0.0
Sorry for the confusion. I have two separate KC instances running for testing, KC18 with the open source plugin and KC24 with the kc-24-2.0.0-enterprise plugin. I get the same 500 responses from both KC instances when testing the code above, but in KC18, the user is not added to the group. In KC24, on the other hand, the user is added to the group, but it still results in a 500 for me which might confuse leading IDPs.
The ideal way is telling the client not to send a lot of update-requests on a single resource. SCIM is defining the Bulk-Endpoint and allows several operations on patch to handle this.
I totally agree, but unfortunately, we might not be able to pressurize Microsoft into working with the SCIM standard.
I see. Thanks for the clarification. The enterprise version should indeed not return a 500 status code. I will look into this!
I extended my integration-test but I am not able to reproduce it. This is my code:
(This test has 1000 users created in the database that are all getting added to a single group in parallel patch-requests) version kc-23-2.0.0
dynamicNodes.add(DynamicTest.dynamicTest("Provoke OptimisticLockException on Groups-endpoint", () -> {
try
{
new ForkJoinPool(20).submit(() -> {
IntStream.range(0, users.size()).parallel().mapToObj(index -> {
final int userBulkId = 2 + index;
final User randomUser = users.get(String.valueOf(userBulkId));
Assertions.assertNotNull(randomUser, String.valueOf(userBulkId));
final String patchRequest;
patchRequest = """
{
"schemas": [
"urn:ietf:params:scim:api:messages:2.0:PatchOp"
],
"Operations": [
{
"op": "add",
"path": "members",
"value": {
"value": "%s",
"type": "%s"
}
}
]
}
""".formatted(randomUser.getId().get(), ResourceTypeNames.USER);
groupMembers.add(Member.builder()
.value(randomUser.getId().get())
.display(randomUser.getUserName().get())
.type(ResourceTypeNames.USER)
.build());
return Triple.of(index, patchRequest, randomUser.getId().get());
}).forEach(triple -> {
final String patchRequest = triple.getMiddle();
final String groupId = groups.get(String.valueOf(groupBulkId)).getId().get();
ServerResponse<Group> response = scimRequestBuilder.patch(Group.class, EndpointPaths.GROUPS, groupId)
.setPatchResource(patchRequest)
.sendRequest();
if (HttpStatus.NO_CONTENT == response.getHttpStatus() || HttpStatus.OK == response.getHttpStatus())
{
indexes.add(triple.getLeft());
}
else
{
ServerResponse<Group> groupResponse = scimRequestBuilder.get(Group.class, EndpointPaths.GROUPS, groupId)
.sendRequest();
Assertions.assertEquals(HttpStatus.OK, groupResponse.getHttpStatus());
Group group = groupResponse.getResource();
final String userMemberId = triple.getRight();
boolean wasNotAdded = group.getMembers()
.stream()
.noneMatch(member -> member.getValue().get().equals(userMemberId));
Assertions.assertTrue(wasNotAdded);
membersNotAddedDueToError.add(userMemberId);
}
});
}).get();
}
catch (Exception ex)
{
throw ex;
}
}));
/*
* this test will verify that the members that were added to the group in the previous were only added if no error was returned in the previous test
*/
dynamicNodes.add(DynamicTest.dynamicTest("Verify members successfully added to group", () -> {
{
ServerResponse<ListResponse<User>> response = scimRequestBuilder.list(User.class, EndpointPaths.USERS)
.filter("not(groups pr)")
.get()
.sendRequest();
Assertions.assertEquals(HttpStatus.OK, response.getHttpStatus());
log.warn("\n{}",
response.getResource()
.getListedResources()
.stream()
.map(user -> user.getUserName().get())
.collect(Collectors.joining("\n")));
}
{
Collections.sort(indexes);
ServerResponse<Group> response = scimRequestBuilder.get(Group.class,
EndpointPaths.GROUPS,
groups.get(String.valueOf(groupBulkId)).getId().get())
.sendRequest();
Assertions.assertEquals(HttpStatus.OK, response.getHttpStatus());
Group retrievedGroup = response.getResource();
Assertions.assertEquals(users.size() - membersNotAddedDueToError.size(), retrievedGroup.getMembers().size());
}
}));
I think this code is a pretty good example and it provokes a lot of OptimisticLockExceptions in the process. Approximately 1% of the requests end up in status code 500 and the test does then verify that the member was not added to the group which is what we expect. I am willing to fix the issue but I can't fix it if I cannot reproduce it...
Hey, I think I have found out the cause. If E-Tag support is enabled in the SCIM settings, the behavior is exactly as you predicted. If it is disabled, wrongfully 500 responses appear.
I created 1000 users parallel in KC24 and added them to the same group.
But I also checked the occurrences:
I received 131–500 responses on the PATCH call without any of the users being added to the group.
I received 177 - 500 responses to the PATCH call, with 169 users still added and 8 not added to the group.
My tests actually never had ETags activated to begin with but this would make the whole thing even more weird. ETags are calculated on the JSON response-body just after the response-body was generated. The ETag is a SHA-1 digest on the minified JSON-string. I have no idea how this should affect the whole thing.
But just to make sure, I tried both variations, with ETag and without and I get the same result in both cases... I still cannot reproduce this...
Which database are you using? Maybe we can find the problem there. I am testing with postgres
I totally get your concerns, and I am confused too. This problem showed up in my Docker setup locally (using KC23 with an h2 DB) and in my KC24 test setup with a Postgres DB. Could there be a problem with calculating the SHA-1 for the concurrent group resource when ETag is enabled?
Hey, I have some interesting news about the behavior of Entra for you. We have collaborated with a customer to run a few tests for Entra Keycloak24 SCIM provisioning, including 5232 users and 8 groups (the number of replaced users and groups does not match these numbers because of old linked replaced ids).
We started two initial and one incremental cycle due to a misconfiguration of a custom attribute mapping. No users or groups have been deleted in between any cycles.
In the first initial cycle, we tracked a few 500 responses when PATCHING group memberships. The good news is that Entra retried those requests and received a 200 response. This can be observed with Group: replacedGroupId11, first initial cycle: overall 993 calls, 35 x 500 responses, 958 x 200 responses, there should be 959 users in the group. The missing PATCH call depends on Entra. We will look into why that particular call is missing (it is likely because they messed up the state).
In the second initial cycle, each PATCH call received a 200-word response. This is because no user or group has been deleted in Keycloak, but the state is still reset. Entra does not check if a user is already assigned to a group and just makes a PATCH request. On the other hand, before provisioning groups and users, they make a GET call and check if the object exists and already matches in the source and target systems.
Here you can find our access_log: censored_access.log
Okay, I have a little bit of trouble following here, because I am not sure anymore if the error still seems to be on my side or if the error was found on Entra side:
Hey, with the help of our customer we could analyze and now explain all deviations.
--> Everything worked 100% on the Plugin-Side. Microsoft Entra has some specific not very intuitive behaviour.
Details / Facts:
1.) When there are a lot of Users and User-Group Membership, they spawn multiple Threads which PATCH the same group in a very tight interval. 2.) This leads to ~4% of Group Patches overwhelming even the 3 Re-Tries Cycle that was described above and return a 500 HTTP to Entra (but correctly NOT persisting the association in KC) 2b.) Thankfully, Entra also has a Retry Mechanism that fixes all those missing associations within the same provisioning cycle.
3.) What concerned us, was the small delta between "Group Memberships recorded by us in Keycloak", and "Group Memberships according to the Entra Admin UI", the latter showing 9 Users more.
Thankfully that delta is explainable now.
5 Users were inactive in Microsoft Entra when the initial provisioning cycle started. They were never provisioned. --> But did still count in the Entra UI.
Other Users were synced to Keycloak, but deleted on Entra in the meantime. --> Due to their "soft-deletion" implementation, these users were just disabled via provisioning in Keycloak (without group associations). Even tough on Entra, they were already fully deleted and invisible to the Enterprise Application UI there.
TLDR: In Microsoft Entra, the number of users that are shown as members of a group which is assigned to an Entra Enterprise Application, can differ from the number of users and group memberships that are actually provisioned to a third party.
Good to know. Thanks for sharing this . And I am glad that my plugin is not doing something terribly strange :-)
Hey, as I have encountered the same problem as mentioned in #90, my company has acquired the SCIM for Keycloak enterprise edition to see if we could fix this issue with the use of KC24.
Therefore, I created a test setup with the kc-24-2.0.0-enterprise plugin installed on KC24 and a small Java application that mocks Entra.
The Java application tries to parallel create multiple users and add them to the same group.
The sequence looks like this:
The result is the following: All threads die due to an optimistic locking exception in Keycloak except a single one. The weird thing is that the thread which gets the 500 internal server error still patched the user into the group in the same request.
Internal server error:
2024-04-04 10:25:27,775 WARN [io.agroal.pool] (executor-thread-22) Datasource 'I have also testet this with KC18 and got the exact same result.
Should the fix already be in the kc-24-2.0.0-enterprise plugin?
Thanks in advance!