camunda-community-hub / camunda-platform-7-keycloak

Camunda Keycloak Identity Provider Plugin
Apache License 2.0
132 stars 74 forks source link

KeycloakPlugin freezes Camunda Platform when user claims task #69

Closed Noordsestern closed 3 years ago

Noordsestern commented 3 years ago

Hi,

Running:

Camunda Platform Run 7.15.0 KeycloackIdentifyProviderPlugin 2.1.0 Keycloak 13

I am not sure anymore that the refresh token thing has something to do with this, so we can probably close #68 as it is only a finding while actually investigating this issue: Everything is fine, until I claim my first user task. The UI does not respond to the claim, but in database, I can verify that I am the assigne. Also in camunda cockpit I am the assignee: image

Also in Task Authorization, I can see that I have ALL rights to the task. image

But I cannot work on the task. From that time on, opening the task list crashes the database access after 30 seconds:

2021-07-14T19:17:46.097265933Z org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
2021-07-14T19:17:46.097270122Z  at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:309) ~[spring-jdbc-5.3.4.jar!/:5.3.4]
2021-07-14T19:17:46.097274053Z  at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400) ~[spring-tx-5.3.4.jar!/:5.3.4]
2021-07-14T19:17:46.097277851Z  at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373) ~[spring-tx-5.3.4.jar!/:5.3.4]
2021-07-14T19:17:46.097281325Z  at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137) ~[spring-tx-5.3.4.jar!/:5.3.4]
2021-07-14T19:17:46.097284797Z  at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:70) ~[camunda-engine-spring-7.15.0.jar!/:7.15.0]
2021-07-14T19:17:46.097288112Z  at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70) ~[camunda-engine-7.15.0.jar!/:7.15.0]
2021-07-14T19:17:46.097291475Z  at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35) ~[camunda-engine-7.15.0.jar!/:7.15.0]
2021-07-14T19:17:46.097294805Z  at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33) ~[camunda-engine-7.15.0.jar!/:7.15.0]
2021-07-14T19:17:46.097310237Z  at org.camunda.bpm.engine.impl.IdentityServiceImpl.isReadOnly(IdentityServiceImpl.java:85) ~[camunda-engine-7.15.0.jar!/:7.15.0]
2021-07-14T19:17:46.097314066Z  at org.camunda.bpm.webapp.impl.engine.ProcessEnginesFilter.needsInitialUser(ProcessEnginesFilter.java:247) ~[camunda-webapp-7.15.0-classes.jar:7.15.0]
2021-07-14T19:17:46.097318850Z  at org.camunda.bpm.webapp.impl.engine.ProcessEnginesFilter.serveIndexPage(ProcessEnginesFilter.java:175) ~[camunda-webapp-7.15.0-classes.jar:7.15.0]
2021-07-14T19:17:46.097322545Z  at org.camunda.bpm.webapp.impl.engine.ProcessEnginesFilter.applyFilter(ProcessEnginesFilter.java:127) ~[camunda-webapp-7.15.0-classes.jar:7.15.0]
2021-07-14T19:17:46.097325836Z  at org.camunda.bpm.webapp.impl.filter.AbstractTemplateFilter.doFilter(AbstractTemplateFilter.java:58) ~[camunda-webapp-7.15.0-classes.jar:7.15.0]
2021-07-14T19:17:46.097329121Z  at org.camunda.bpm.spring.boot.starter.webapp.filter.LazyDelegateFilter.doFilter(LazyDelegateFilter.java:60) ~[camunda-bpm-spring-boot-starter-webapp-core-7.15.0.jar:7.15.0]

After a few minutes, camunda platform recovers, but as soon as I open the task list, the db connection crashes again.

If I remove the assigne from the task (either through direct DB access or cockpit), I can again open the complete tasklist. But as soon as I claim a task, weirdness starts all over.

And this only happens, when I have the KeycloakPlugin installed. It works fine without it. This is so confusing :confused:

Regards, Markus

VonDerBeck commented 3 years ago

@Noordsestern: Do you have any TaskListeners attached to the activity "Retoure korrigieren"? Any other cross cutting concepts like e.g. Spring Boot Event Listeners? This really sounds confusing. The Keycloak Plugin has nothing to do with transactions.

How many users do you have in Keycloak? Are you using the LDAP user federation? Anything else? Typically this performs well, even if you have > 10.000 users within Keycloak. In order to make sure that this has nothing to do with bad Keycloak performance: can you temporarily try the 2.2.0-SNAPSHOT version of the Keycloak Identity Provider Plugin and then set the cacheEnabled attribute to true?

VonDerBeck commented 3 years ago

... Sorry, 2.1.1-SNAPSHOT is the newest Snapshot version. https://app.camunda.com/nexus/repository/camunda-bpm-community-extensions-snapshots/org/camunda/bpm/extension/camunda-bpm-identity-keycloak-run/2.1.1-SNAPSHOT/camunda-bpm-identity-keycloak-run-2.1.1-20210713.140837-10.jar

VonDerBeck commented 3 years ago

Another hint: please set the authorizationCheckEnabled configuration attribute of the plugin to false and check whether this is different.

If this property is set to true, then authorization checks are performed when querying for users or groups. Otherwise authorization checks are not performed when querying for users or groups. Default: true. Note: If you have a huge amount of Keycloak users or groups we advise to set this property to false to improve the performance of the user and group query.

Noordsestern commented 3 years ago

Do you have any TaskListeners attached to the activity "Retoure korrigieren"? Any other cross cutting concepts like e.g. Spring Boot Event Listeners?

There are not listeners or anything attached. Only bare Camunda Run with KeycloakPlugin jar :

FROM camunda/camunda-bpm-platform:run-7.15.0

USER camunda
ADD camunda-bpm-identity-keycloak-run-2.1.0.jar /camunda/configuration/userlib/
ADD --chown=camunda:camunda default.yml /camunda/configuration/default.yml

ENTRYPOINT ["/sbin/tini", "--"]
CMD ["./camunda.sh"]

How many users do you have in Keycloak? Are you using the LDAP user federation?

I do not know how many users we have, but i assume it doesnt exceed hundreds. We do use User Federation.

please set the authorizationCheckEnabled

We did try authorizationCheckEnabled last week. But without success.

On the other hand, we were so desperate that we just tried any possible configuration, maybe we missed something. I will try 2.1.1-SNAPSHOT and the configuration this afternoon.

VonDerBeck commented 3 years ago

I do not know how many users we have, but i assume it doesnt exceed hundreds. We do use User Federation.

You should check the synchronisation settings of the user federation as well.

Noordsestern commented 3 years ago

The snapshot seems to have missing dependencies:

│ Caused by: java.lang.ClassNotFoundException: com.github.benmanes.caffeine.cache.Ticker                                                                                                                                                                                                                                    ││     at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) ~[na:na]                                                                                                                                                                                                                                      ││     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) ~[na:na]                                                                                                                                                                                                                                           ││     at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:151) ~[camunda-bpm-run-core.jar:7.15.0]                                                                                                                                                                               ││     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[na:na]    
VonDerBeck commented 3 years ago

I will check that soon.

VonDerBeck commented 3 years ago

The snapshot seems to have missing dependencies:

│ Caused by: java.lang.ClassNotFoundException: com.github.benmanes.caffeine.cache.Ticker                                                                                                                                                                                                                                    ││     at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) ~[na:na]                                                                                                                                                                                                                                      ││     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) ~[na:na]                                                                                                                                                                                                                                           ││     at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:151) ~[camunda-bpm-run-core.jar:7.15.0]                                                                                                                                                                               ││     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[na:na]    

Sorry for that. So we currently can't test it using the supposed new cache feature. The build process started to have problems with the maven shade plugin here on the community hub - which should not affect you and is a different story. So the cache feature is sadly not near to be released and will take some time. It would have been a quick identication that your Keycloak server is not that responsive...

VonDerBeck commented 3 years ago

If you're not tired of testing:

If you still want to check if this is a performance issue by activating a cache feature (it caches results of Keycloak queries for a certain time), attached you'll find the Keycloak Plugin. Build locally and thus free of the Community Hub specifics when deploying artifacts to Nexus. Tested with Camunda Run 7.15.3-ee

camunda-bpm-identity-keycloak-run-2.1.1-SNAPSHOT.jar.zip

Besides this small test, I would a) set the flag "authorizationCheckEnabled" of the Keycloak plugin to false and then b) look at the synchronization frequency between Keycloak & LDAP / Active Directory in the User Federation.

If you don't mind you can send me your default.yaml

Noordsestern commented 3 years ago

Snapshot contain dependencies now, thank you 👍 Unfortunately, the behaviour remains the same 😢

Our default.yml is:

# Externalized Keycloak configuration
keycloak:
  # Keycloak access for the Identity Provider plugin.
  url.plugin: ${ENV_KEYCLOAK_URL_PLUGIN:https://localhost:9001}

  # Keycloak Camunda Identity Client
  client.id: ${ENV_KEYCLOAK_CLIENT_ID:camunda-identity-service}
  client.secret: ${ENV_KEYCLOAK_CLIENT_SECRET:42xx42xx-a17b-c63d-e26f-42xx42xx42xx42xx}

# Camunda Keycloak Identity Provider Plugin
plugin.identity.keycloak:
  keycloakIssuerUrl: ${keycloak.url.plugin}/auth/realms/dpa
  keycloakAdminUrl: ${keycloak.url.plugin}/auth/admin/realms/dpa
  clientId: ${keycloak.client.id}
  clientSecret: ${keycloak.client.secret}
  useUsernameAsCamundaUserId: true
  useGroupPathAsCamundaGroupId: true
  administratorGroupName: DL - Camunda - Admins
  disableSSLCertificateValidation: true
  cacheEnabled: true
  authorizationCheckEnabled: false

camunda.bpm:
  # https://docs.camunda.org/manual/latest/user-guide/security/#authorization
  # https://docs.camunda.org/manual/latest/user-guide/process-engine/authorization-service/
  authorization.enabled: true
VonDerBeck commented 3 years ago

Hmmm.... Just thinking out aloud: It says "2021-07-14T19:17:46.097265933Z org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms." So there are no database connections available any more. The plugin does nothing special with transactions. All REST queries against Keycloak might be part of an overal Camunda transaction, ok. The plugin might query authorization information which is still part of the database and not Keycloak. By setting authorizationCheckEnabled to false this is disabled.

What database do you use for Camunda? Have you configured the connection pool? What is your min size and max size?

Setting

 logging.level.org.camunda.bpm.extension.keycloak: DEBUG
 logging.level.org.springframework.web.client.RestTemplate: DEBUG

might give you further debug info on the queries against the Identity Provider Plugin and the ones sent to Keycloak. Would be interesting to see what happens once you try to open the task list.

Noordsestern commented 3 years ago

Uffff... the worst possible outcome: when activating DEBUG it suddenly worked. Not at first, but once the first request failed, the caching must have kicked in and everything is fast and works properly.

We run Postgres12 DB with default settings on Camunda side. I will try to override the hikari settings and see if that has an effect. We assume now, that querying users from keycloak takes for some reason a looooong time. Because also listing users in Admin panel takes forever. When opening a task list, the tasks belonging to my user are fetched. If now the user query take forever it might exceed transaction timeout.

VonDerBeck commented 3 years ago

I think you are on the right track.

Typically Keycloak itself is very fast. I would definitely check the user federation settings. If Keycloak itself syncs user data with LDAP / Active Directory on each single request this might be the reason for Keycloak answering so slowly. Of course there might be other reasons for Keycloak being slow.

Noordsestern commented 3 years ago

Alright. Another day of wonders and learning:

It turns out, that the GET {realm]/users endpoint is extremely slow in our Keycloak. But we cannot figure out why. Sync wth LDAP should happen only every 12 hours. I limit the number of users imported from LDAP to 120, which improves performance only a bit.

Now, opening the task list containg at least 1 assigned user takes roughly 1 minute. But then caching seems to work and everything goes fast.

This is the log I get when opening the task list:

2021-07-20 14:11:08.298 DEBUG 10 --- [nio-8080-exec-5] org.camunda.bpm.extension.keycloak       : KEYCLOAK-01050 Keycloak group query results: [UserEntity[id=stah045, revision=0, firstName=Markus, lastName=Stahl, email=markus.stahl@google.de, password=null, salt=null, lockExpirationTime=null, attempts=0], ]  
2021-07-20 14:11:08.874 DEBUG 10 --- [nio-8080-exec-9] org.camunda.bpm.extension.keycloak       : KEYCLOAK-01021 FIND user with query ?max=250                                                                                                                                                                           
2021-07-20 14:11:08.875 DEBUG 10 --- [nio-8080-exec-9] o.c.b.e.k.rest.KeycloakRestTemplate      : HTTP GET https://keycloak.postadress.io/auth/admin/realms/camunda/users?max=250                                                                                                                                     
2021-07-20 14:11:08.876 DEBUG 10 --- [nio-8080-exec-9] o.c.b.e.k.rest.KeycloakRestTemplate      : Accept=[text/plain, application/json, application/*+json, */*]                                                                                                                                                          
2021-07-20 14:11:34.581 DEBUG 10 --- [nio-8080-exec-9] o.c.b.e.k.rest.KeycloakRestTemplate      : Response 200 OK                                                                                                                                                                                                         
2021-07-20 14:11:34.581 DEBUG 10 --- [nio-8080-exec-9] o.c.b.e.k.rest.KeycloakRestTemplate      : Reading to [java.lang.String] as "application/json"     

We do have like 800 groups. Maybe that is an issue?

What still confuses me that on login everything is fast and fine. But only when Admin / List all users is called or Tasklist is opened in the UI, things somehow slow down. When I log in, I can see all my LDAP groups in my Camunda profile. But if everything is there after login, why does Camunda query ALL users / groups when I only want to see my tasks.

VonDerBeck commented 3 years ago

Hi @Noordsestern have you checked the basics of your Keycloak installation? Database, Memory, ... etc.? Sorry for asking, but sometimes we forget the very basic things. 25 seconds for querying ~ 100 users?!? No way that this is normal behaviour. Maybe count down 2 seconds if its really bad. The question I can't answer is why the Camunda Tasklist queries all users upon opening it.

Noordsestern commented 3 years ago

Alright. I can't figure out, why it is so slow. A few findings:

Suspicious

Big group-query

KeycloackPlugin or Camunda trigger a huge group-query when opening a task list, querying basically every group that is available in this log line:

2021-07-21 12:51:16.970 DEBUG 9 --- [nio-8080-exec-8] org.camunda.bpm.extension.keycloak       : KEYCLOAK-01050 Keycloak group query results:

The query is huge, because we have roughly 800 groups.

Thoughts about big query group

I cannot imagine, why all have to be queried, even though there are only 2 tasks available.

Tweaks

In case somebody runs in to a similar problem. those action showed improvements:

[Camunda] Enhance Hakari config 👈

Camunda seems to keep an open db transaction for the duration of its keycloak-queries. When queries are slow the db connections block the complete database. Adding the following to your default.yml or production.yml helps:

spring.datasource:
    # see https://github.com/brettwooldridge/HikariCP#gear-configuration-knobs-baby
    hikari:
      connectionTimeout: 30000
      idleTimeout: 600000
      maxLifetime: 900000
      #Hikari 4.0
      keepaliveTime: 60000

      # see https://github.com/brettwooldridge/HikariCP#gear-configuration-knobs-baby (minimumIdle)
      # and https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing
      minimumIdle: 10
      maximumPoolSize: 20

[Camunda] Keycloak Plugin>=2.1.1 required 👈

In our experiments we always required the properties. Because caching is only available in version 2.1.1 onwards, you do require the current snapshot in order to set cacheEnabled:

# Camunda Keycloak Identity Provider Plugin
plugin.identity.keycloak:
  cacheEnabled: true
  authorizationCheckEnabled: false

[Keycloak] Limit amount of users in User Federation 👈

I left the company realm and created a realm specific for Camunda. I narrowed down the LDAP Tree which will be synced to only people who might possibly work with Cmaunda. Now I have only 125 users and 440 groups (groups have recurions). That improved query performance.

[Keycloak] More performance for Keycloak DB 🤷

We increased cluster resources for Keycloak Postgres DB to 1 cpu and 2 Gib . But we did not witness any effect.

Noordsestern commented 3 years ago

I will post this also to Camunda and close this issue then. I might have a few requests for the cashing feature, though ;)

VonDerBeck commented 3 years ago

Thanks for your investigations and especially for sharing your findings.

VonDerBeck commented 3 years ago

I'd like to add one thing to your findings: the Keycloak Identity Provider Plugin never queries all users or all groups configured in Keycloak. There is a maxResultSize parameter with a default value of 100 exactly for preventing side effects of such queries. I have seen a Keycloak installation with more than 10.000 users and still the user query returned in acceptable time. I suspect that the cause is to be found in Keycloak itself and its configuration.

However, this is a good motivation to make the new cache feature release-ready. The current SNAPSHOT build process is unfortunately still having problems, but that will hopefully be solved soon.