FusionAuth / fusionauth-issues

FusionAuth issue submission project
https://fusionauth.io
90 stars 12 forks source link

Fusionauth is failing after requests reach around 8 requests per second #2641

Closed LohithBlaze closed 4 months ago

LohithBlaze commented 5 months ago

Fusionauth scaling issue on a managed instance

Description

We have our own fusionauth instance running, and we have the fusionauth running on a kubernetes pod, we were recently load testing our services and as soon as we reach more than 5 requests per second, fusionauth is failing with the following error. This is consistently reproducible

2024-02-06 06:29:19.752 AM ERROR io.fusionauth.app.primeframework.error.ExceptionExceptionHandler - An unhandled exception was thrown org.apache.ibatis.exceptions.PersistenceException:

Error querying database. Cause: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 2000ms.

The error may exist in io/fusionauth/api/domain/TenantMapper.xml

The error may involve io.fusionauth.api.domain.TenantMapper.retrieveById

The error occurred while executing a query

Cause: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 2000ms.

at org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:153)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:145)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:140)
at org.apache.ibatis.session.defaults.DefaultSqlSession.selectOne(DefaultSqlSession.java:76)
at jdk.internal.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.apache.ibatis.session.SqlSessionManager$SqlSessionInterceptor.invoke(SqlSessionManager.java:357)
at jdk.proxy2/jdk.proxy2.$Proxy51.selectOne(Unknown Source)
at org.apache.ibatis.session.SqlSessionManager.selectOne(SqlSessionManager.java:166)
at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:87)
at org.apache.ibatis.binding.MapperProxy$PlainMethodInvoker.invoke(MapperProxy.java:145)
at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:86)
at jdk.proxy2/jdk.proxy2.$Proxy76.retrieveById(Unknown Source)
at io.fusionauth.api.service.system.DefaultTenantReaderService.retrieveById(DefaultTenantReaderService.java:83)
at io.fusionauth.app.action.api.BaseTenantAPIAction.resolveTenant(BaseTenantAPIAction.java:73)
at jdk.internal.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.primeframework.mvc.util.ReflectionUtils.invoke(ReflectionUtils.java:414)
at org.primeframework.mvc.parameter.DefaultParameterHandler.handle(DefaultParameterHandler.java:114)
at org.primeframework.mvc.parameter.DefaultParameterWorkflow.perform(DefaultParameterWorkflow.java:55)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.parameter.DefaultURIParameterWorkflow.perform(DefaultURIParameterWorkflow.java:92)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.scope.DefaultScopeRetrievalWorkflow.perform(DefaultScopeRetrievalWorkflow.java:50)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.action.DefaultActionMappingWorkflow.perform(DefaultActionMappingWorkflow.java:113)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.security.DefaultSavedRequestWorkflow.perform(DefaultSavedRequestWorkflow.java:65)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.cors.CORSRequestWorkflow.perform(CORSRequestWorkflow.java:65)
at org.primeframework.mvc.workflow.SubWorkflowChain.continueWorkflow(SubWorkflowChain.java:50)
at org.primeframework.mvc.workflow.DefaultMVCWorkflow.perform(DefaultMVCWorkflow.java:108)
at org.primeframework.mvc.PrimeMVCRequestHandler.handle(PrimeMVCRequestHandler.java:72)
at io.fusionauth.http.server.HTTPWorker.run(HTTPWorker.java:50)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

Not sure what the issue is here, is this something related to configuration where we can increase the connection pool size?

Specifically We are using group apis to create lot of groups and testing the load of group apis.

Observed versions

1.45.2

Steps to reproduce

Steps to reproduce the behavior:

  1. Create groups at 8 requests per second, fusionauth starts failing

Expected behavior

Fusionauth shouldnt fail with the above exception

Platform

(Please complete the following information)

Additional context

Also we observed that if we start from 1rps and go slowly, the fusionauth is able to handle until 10rps, but if we start directly at 5 rps, then these errors start coming up.

LohithBlaze commented 4 months ago

We observed that fusionauth is going into a lock wait timeout state where multiple transactions are getting locked.

Here the results

---TRANSACTION 1768341, ACTIVE 1 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658760, OS thread handle 140404620998400, query id 16129873 10.1.0.173 fusionauth ---TRANSACTION 1768340, ACTIVE 1 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658754, OS thread handle 140405435631360, query id 16129872 10.1.0.173 fusionauth ---TRANSACTION 1768339, ACTIVE 1 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658796, OS thread handle 140404625221376, query id 16129840 10.1.0.173 fusionauth ---TRANSACTION 1768307, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658718, OS thread handle 140404619941632, query id 16129244 10.1.0.173 fusionauth ---TRANSACTION 1768293, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658788, OS thread handle 140404767643392, query id 16129218 10.1.0.173 fusionauth ---TRANSACTION 1768271, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658735, OS thread handle 140404607330048, query id 16129112 10.1.0.173 fusionauth ---TRANSACTION 1768266, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658649, OS thread handle 140404794029824, query id 16129067 10.1.0.173 fusionauth ---TRANSACTION 1768247, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658781, OS thread handle 140404765624064, query id 16128922 10.1.0.173 fusionauth ---TRANSACTION 1768245, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658690, OS thread handle 140404786644736, query id 16128885 10.1.0.173 fusionauth ---TRANSACTION 1768240, ACTIVE 3 sec 3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 658608, OS thread handle 140404597831424, query id 16128841 10.1.0.173 fusionauth

mysql> SELECT t.thread_id, event_name, sql_text, rows_affected, rows_examined, processlist_id, processlist_time, processlist_state FROM events_statements_history h INNER JOIN threads t ON h.thread_id = t.thread_id WHERE processlist_id = 658760 ORDER BY event_id\G 1. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 2. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 3. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT t.id AS t_id, t.access_token_signing_keys_id AS t_access_token_signing_keys_id, t.data AS t_data, t.client_credentials_access_token_populate_lambdas_id AS t_client_credentials_access_token_populate_lambdas_id, t.confirm_child_email_templates_id AS t_confirm_child_email_templates_id, t.email_update_email_templates_id AS t_email_update_email_templates_id, t.email_verified_email_templates_id AS t_email_verified_email_templates_id, t.failed_authentication_user_actions_id AS t_failed_authentication_user_actions_id, t.family_request_email_templates_id AS t_family_request_email_templates_id, t.forgot_password_email_templates_id AS t_forgot_password_email_templates_id, t.id_token_signing_keys_id AS t_id_to... rows_affected: 0 rows_examined: 2 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 4. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 5. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 6. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT u.id AS u_id, u.active AS u_active, i.breached_password_last_checked_instant AS i_breached_password_last_checked_instant, i.breached_password_status AS i_breached_password_status, u.birth_date AS u_birth_date, u.clean_speak_id AS u_clean_speak_id, i.connectors_id AS i_connectors_id, u.data AS u_data, i.email AS i_email, i.encryption_scheme AS i_encryption_scheme, u.expiry AS u_expiry, i.factor AS i_factor, u.first_name ... rows_affected: 0 rows_examined: 160 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 7. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 8. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 9. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT @@session.transaction_read_only rows_affected: 0 rows_examined: 1 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 10. row thread_id: 681818 event_name: statement/sql/insert sql_text: INSERT INTO groups (id, name, data, insert_instant, last_update_instant, tenants_id) VALUES (x'460626342d2b44aaa9b7a6126301d446', '61f28e03-3bd0-4942-a1ad-2293828fe47c', '{"anon_contacts":[],"group_name_generated":"Laksh","group_type":"meeting-chat","is_global":false}', 1707285855794, 1707285855794, x'0b8ab788e9584eb896a68a7e7e37746a') rows_affected: 1 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 10 rows in set (0.00 sec)

1. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 2. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 3. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT t.id AS t_id, t.access_token_signing_keys_id AS t_access_token_signing_keys_id, t.data AS t_data, t.client_credentials_access_token_populate_lambdas_id AS t_client_credentials_access_token_populate_lambdas_id, t.confirm_child_email_templates_id AS t_confirm_child_email_templates_id, t.email_update_email_templates_id AS t_email_update_email_templates_id, t.email_verified_email_templates_id AS t_email_verified_email_templates_id, t.failed_authentication_user_actions_id AS t_failed_authentication_user_actions_id, t.family_request_email_templates_id AS t_family_request_email_templates_id, t.forgot_password_email_templates_id AS t_forgot_password_email_templates_id, t.id_token_signing_keys_id AS t_id_to... rows_affected: 0 rows_examined: 2 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 4. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 5. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 6. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT u.id AS u_id, u.active AS u_active, i.breached_password_last_checked_instant AS i_breached_password_last_checked_instant, i.breached_password_status AS i_breached_password_status, u.birth_date AS u_birth_date, u.clean_speak_id AS u_clean_speak_id, i.connectors_id AS i_connectors_id, u.data AS u_data, i.email AS i_email, i.encryption_scheme AS i_encryption_scheme, u.expiry AS u_expiry, i.factor AS i_factor, u.first_name ... rows_affected: 0 rows_examined: 160 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 7. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 8. row thread_id: 681818 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 9. row thread_id: 681818 event_name: statement/sql/select sql_text: SELECT @@session.transaction_read_only rows_affected: 0 rows_examined: 1 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 10. row thread_id: 681818 event_name: statement/sql/insert sql_text: INSERT INTO groups (id, name, data, insert_instant, last_update_instant, tenants_id) VALUES (x'460626342d2b44aaa9b7a6126301d446', '61f28e03-3bd0-4942-a1ad-2293828fe47c', '{"anon_contacts":[],"group_name_generated":"Laksh","group_type":"meeting-chat","is_global":false}', 1707285855794, 1707285855794, x'0b8ab788e9584eb896a68a7e7e37746a') rows_affected: 1 rows_examined: 0 processlist_id: 658760 processlist_time: 1 processlist_state: NULL 10 rows in set (0.00 sec)

mysql> SELECT t.thread_id, event_name, sql_text, rows_affected, rows_examined, processlist_id, processlist_time, processlist_state FROM events_statements_history h INNER JOIN threads t ON h.thread_id = t.thread_id WHERE processlist_id = 658754 ORDER BY event_id\G Empty set (0.00 sec)

mysql> SELECT t.thread_id, event_name, sql_text, rows_affected, rows_examined, processlist_id, processlist_time, processlist_state FROM events_statements_history h INNER JOIN threads t ON h.thread_id = t.thread_id WHERE processlist_id = 658796 ORDER BY event_id\G 1. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 2. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 3. row thread_id: 681854 event_name: statement/sql/select sql_text: SELECT t.id AS t_id, t.access_token_signing_keys_id AS t_access_token_signing_keys_id, t.data AS t_data, t.client_credentials_access_token_populate_lambdas_id AS t_client_credentials_access_token_populate_lambdas_id, t.confirm_child_email_templates_id AS t_confirm_child_email_templates_id, t.email_update_email_templates_id AS t_email_update_email_templates_id, t.email_verified_email_templates_id AS t_email_verified_email_templates_id, t.failed_authentication_user_actions_id AS t_failed_authentication_user_actions_id, t.family_request_email_templates_id AS t_family_request_email_templates_id, t.forgot_password_email_templates_id AS t_forgot_password_email_templates_id, t.id_token_signing_keys_id AS t_id_to... rows_affected: 0 rows_examined: 2 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 4. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 5. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 6. row thread_id: 681854 event_name: statement/sql/select sql_text: SELECT g.id AS g_id, g.data AS g_data, g.name AS g_name, g.insert_instant AS g_insert_instant, g.last_update_instant AS g_last_update_instant, g.tenants_id AS g_tenants_id, ar.id AS ar_id, ar.applications_id AS ar_applications_id, ar.insert_instant AS ar_insert_instant, ar.last_update_instant AS ar_last_update_instant, ar.name AS ar_name, ar.description AS ar_description, ar.is_default AS ar_is_default, ar.is_super_role AS ar_is_super_role FROM groups AS g INNER JOIN tenants AS t ON t.id = g.tenants_id LEFT OUTER JOIN group_application_roles AS gar ON g.id = gar.groups_id LEFT OUTER JOIN application_roles AS ar ON gar.application_roles_id = ar.id WHERE g.name = '0a215a06-2005-4ee3-a663-b50523ead8a0' AND g.tenants_id = x'0b8ab788e... rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 7. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=1 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 8. row thread_id: 681854 event_name: statement/sql/set_option sql_text: SET autocommit=0 rows_affected: 0 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 9. row thread_id: 681854 event_name: statement/sql/select sql_text: SELECT @@session.transaction_read_only rows_affected: 0 rows_examined: 1 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 10. row thread_id: 681854 event_name: statement/sql/insert sql_text: INSERT INTO groups (id, name, data, insert_instant, last_update_instant, tenants_id) VALUES (x'1e59715359274086ab4c98fb38676967', '0a215a06-2005-4ee3-a663-b50523ead8a0', '{"anon_contacts":[],"group_type":"meeting-chat","is_global":false}', 1707285896320, 1707285896320, x'0b8ab788e9584eb896a68a7e7e37746a') rows_affected: 1 rows_examined: 0 processlist_id: 658796 processlist_time: 2 processlist_state: NULL 10 rows in set (0.00 sec)

LohithBlaze commented 4 months ago

these are the transactions that are causing the locks when we hit the load at around 5requests per second.

LohithBlaze commented 4 months ago

Here is the python code that we tried which is directly calling group creations, we are able to reproduce the issue regularly using just by having multiple threads initiate create group apis.


import string
import random
import gevent
from gevent import monkey
monkey.patch_all(httplib=True)
from fusionauth.fusionauth_client import FusionAuthClient

client = FusionAuthClient('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 'http://localhost:9011')
client.set_tenant_id("cb43443d-175b-4323-b516-b4b3458d4f27")

def call_create_group_api():
    random_group_name = "some_random_identifiable_name" + ''.join(random.choices(string.ascii_letters + string.digits, k=20))
    create_group_request = {
        "group": {
            "name": random_group_name
        },
    }

    return client.create_group(create_group_request)

jobs = [gevent.spawn(call_create_group_api) for i in range(100)]
_ = gevent.joinall(jobs)
results = [job.value for job in jobs]

success_count = 0

for result in results:
    print(f"Was successful: {result.was_successful()}")
    if result.was_successful():
        # print(f"Response: {result.success_response}")
        success_count += 1
    else:
        print(f"Response: {result.error_response}")

print(f"Total Successful; {success_count}")
print(f"Total Not Successful; {len(results) - success_count}")
mooreds commented 4 months ago

Hi @LohithBlaze , can you replicate this on the latest version of FusionAuth and share that version of mysql you are using?

LohithBlaze commented 4 months ago

We are able to replicate on the latest fusionauth 1.48.3 as well and the mysql version is 8.3.0

LucasPMorris commented 4 months ago

Hi @LohithBlaze,

I have set up FusionAuth and connected it to MySQL. I've installed the FusionAuth python client library. Initially I had no problem using your script to create groups. I had successfully created 1200 groups. My next test I increased the quantity to 3000 and I had a slew of errors. Going back to 100 calls only 11 were succeeding (at least according to the output of your script).

LohithBlaze commented 4 months ago

In our production code, We are creating groups and adding the first member to the group instantly. Later we add any other members of the group in subsequent calls. We are not assigning any roles as such but we use the data field provided for a member object to set our roles.

In the test script we are able to reproduce the locking of MySQL threads just by creating multiple groups at the same time.

When you ran the test, where you able to check the MySQL threads?

The below queries will give more information about fusionauth MySQL threads being locked and the transactions causing the same.

SELECT   t.trx_id,   t.trx_state,   t.trx_started,   COUNT(distinct w.requesting_trx_id) AS blocked_trxs FROM   information_schema.innodb_lock_waits w INNER JOIN information_schema.innodb_trx t    ON t.trx_id = w.blocking_trx_id GROUP BY t.trx_id,t.trx_state, t.trx_started ORDER BY t.trx_id;

SELECT  t.thread_id,  event_name,  sql_text,  rows_affected,  rows_examined,  processlist_id,  processlist_time,  processlist_state FROM events_statements_history h INNER JOIN threads t ON h.thread_id = t.thread_id WHERE processlist_id = ORDER BY event_id;

LucasPMorris commented 4 months ago

Hi @LohithBlaze,

I had not looked that deeply into it yet. We have ran the same steps via a curl call inside of a unix bash script and can do 100 threaded calls to the API directly just fine. My attention is divided so I am testing in chunks and will get deeper with the testing. It will take me some time to flesh out different reasons. Thank you for your patience as I read into this. I will note that the JOINs can get intensive in general. My typical suggestion is to include a simple wait timer if it turns out to be a resource problem (but still testing).

Thanks, Luke

LohithBlaze commented 4 months ago

Let me know if you got a chance to look at it

mooreds commented 4 months ago

@LucasPMorris any more details from your investigation?

LucasPMorris commented 4 months ago

So I have done more tests but have not got far.

1) I was able to cause failures using a bash script making the API calls directly vs a python script with the client library. So it isn't a python or python client library thing (I didn't suspect as much but ruling out the simple things). 2) In initial testing I was able to see consistent failures after 11 entries were written. At that point I intended to begin observing the mysql transactions. However, when I went back to testing it was not exhibiting the behavior anymore. I can get it to break if I attempt to do between 250 and 300 threaded entries at once (as fast as my computer will allow). If I scale it back down and attempt to create < 250 at a time it works as expected.

I am beginning to analyze the mysql threads when it does break with > 250 group creations. But, at this point, I can not consistently get the same results, let alone at a speed of ~5 per second. Still testing though and I will report when I have more information.

LucasPMorris commented 4 months ago

I was able to move a bit more on this today.

On my first pass through I was able to add 100 groups with the python script you provided without incident As soon as I upped the count 500 it failed (which makes sense, that is a lot of threaded tasks). Upon attempting to create 100 groups after that it would only add 11 at a time, this was very repeatable.

However, I reset my environment (recreated the database and rebuilt FusionAuth). I could force a failure with the python script and a bash script when doing a high number of threaded requests (>250) as mentioned. All of my failures were did not make it into the FusionAuth log however.

I modified your script to look something like this:

for x in range(100):
    jobs = [gevent.spawn(call_create_group_api) for i in range(100)]
    _ = gevent.joinall(jobs)
    results = [job.value for job in jobs]
    time.sleep(0.25)
    print(x)

I was able to routinely add 10k groups in roughly 45 seconds without a problem. I did catch that my default mysql connection pool was set to the default (151) so I also cut that back to 10 and could still routinely add 10k users. With 10 connections, I could not do ~150 threaded jobs at once though.

In summary, I was able to for a moment produce what you are seeing. It seemed that occured after I hit a performance limit that put the DB in an unexpected state. If I reset my environment and prevented that performance limit from being hit I could not replicate. My suggestion would be to limit the number of threaded calls you make to your database and if absolutely needed, add a sleep time in the mix to control the rate of those additions. Another note, there are other factors that I have not tested against (Kubernetes, ALB's, multi-node, etc) that may reduce that performance threshold to a lower amount than on my local development environment.

We can delve into some of the integration requirements that you have with a service contract but I don't have any data to suggest there is a bug in the FusionAuth software specifically.

LohithBlaze commented 4 months ago

You mentioned you increased the connection pool count, do you know the configuration to increase this in fusionauth?

LucasPMorris commented 4 months ago

So this can vary but it is not a limitation of FusionAuth, this is a part of MySQL. In my case I edited the my.cnf file. In my case it was located at /opt/homebrew/etc/ but I am sure it would be somewhere else on your integration. A restart of MySQL is required for the changes to take effect.

If you login to your MySQL database you can also run this command to see what the connection pool size is: SHOW VARIABLES LIKE 'max_connections';.

mooreds commented 4 months ago

Heya, at this point we've investigated this (Lucas spent significant time on it) and determined it is not a FusionAuth bug.

That means that it is support and therefore should either be discussed in the forum or in a support ticket (if a paid support plan is purchased).

I'm going to close this issue as not reproducible. Please feel free to post in either of the above two venues.

LohithBlaze commented 4 months ago

we did some more testing today, we had group.create webhook enabled which was hitting our internal api endpoint, we tested with async and sync group.create and we saw that if we have sync webhook that is group.create webhook enabled then the failures start happening.

mooreds commented 4 months ago

Ah, thanks for the update @LohithBlaze ! That would make sense.