FusionAuth / fusionauth-issues

FusionAuth issue submission project
https://fusionauth.io
91 stars 12 forks source link

Better support for resilience-oriented setups #2910

Open atrauzzi opened 1 week ago

atrauzzi commented 1 week ago

Better support for resilience-oriented setups

Problem

Not all orchestrators or environments support the notion of dependent services to ensure that things like database come up prior to FusionAuth starting up.

The approach to how this is handled varies by community and one competing school of thought to specifying dependencies is to rely on restarts and/or retries.

Unfortunately, when FusionAuth fails to get a lock on a database in silent, maintenance mode, it does not terminate or make any retry attempts.

This means that an environment that wishes to handle resilience through restarts or in-built retry mechanisms (or both!) has no way of guiding FusionAuth to a working state.

Solution

All or some combination of:

Optionally, one of these could also be the default when running with FUSIONAUTH_APP_SILENT_MODE set to true.

Alternatives/workarounds

Unfortunately there is no alternative or workaround. In some ways, the premise of resilience is itself a workaround oriented approach.

Community guidelines

All issues filed in this repository must abide by the FusionAuth community guidelines.

How to vote

Please give us a thumbs up or thumbs down as a reaction to help us prioritize this feature. Feel free to comment if you have a particular need or comment on how this feature should work.

mooreds commented 1 week ago

An explicit flag to instruct FusionAuth to retry its database connection for a certain period/interval

We currently support this via theDATABASE_CONNECTION_TIMEOUT, as documented here: https://fusionauth.io/docs/reference/configuration

An explicit flag to instruct FusionAuth to terminate when it cannot connect to the database

Do you mean a configuration parameter that, when set, causes FusionAuth to fail hard and refuse requests when it can't reach a database?

atrauzzi commented 1 week ago

Hmm, a "timeout" typically means how long before the connection is failed. Either way, I don't think the discrepancy here is around whether there is something triggering timeout behaviour. Although just for posterity, I've tried setting FUSIONAUTH_DATABASE_CONNECTION_TIMEOUT to 30000, just to see what happens:

Log output ``` 2024-11-01T06:36:40.3868140 --------------------------------------------------------------------------------------------------------- 2024-11-01T06:36:40.3868460 ---------------------------------- Entering Silent Configuration Mode ----------------------------------- 2024-11-01T06:36:40.3868690 --------------------------------------------------------------------------------------------------------- 2024-11-01T06:36:40.3868950 2024-11-01T06:36:40.4545260 2024-11-01 11:36:40.451 AM ERROR com.inversoft.maintenance.db.DatabaseSilentModeWorkflowTask - Encountered an error while running silent mode 2024-11-01T06:36:40.4547090 java.lang.IllegalStateException: Unable to capture database lock. This indicates that the database either doesn't support locks or is misconfigured. 2024-11-01T06:36:40.4547660 at com.inversoft.maintenance.db.JDBCMaintenanceModeDatabaseService.lockDatabase(JDBCMaintenanceModeDatabaseService.java:322) 2024-11-01T06:36:40.4548170 at com.inversoft.maintenance.db.DatabaseSilentModeWorkflowTask.perform(DatabaseSilentModeWorkflowTask.java:43) 2024-11-01T06:36:40.4548530 at com.inversoft.maintenance.DefaultMaintenanceModeWorkflow.performSilentConfiguration(DefaultMaintenanceModeWorkflow.java:47) 2024-11-01T06:36:40.4548880 at com.inversoft.maintenance.BaseMaintenanceModePrimeMain.modules(BaseMaintenanceModePrimeMain.java:70) 2024-11-01T06:36:40.4549610 at org.primeframework.mvc.BasePrimeMain.hup(BasePrimeMain.java:69) 2024-11-01T06:36:40.4550010 at org.primeframework.mvc.BasePrimeMain.start(BasePrimeMain.java:100) 2024-11-01T06:36:40.4550370 at io.fusionauth.app.FusionAuthMain.main(FusionAuthMain.java:27) 2024-11-01T06:36:40.4550760 Caused by: org.postgresql.util.PSQLException: The connection attempt failed. 2024-11-01T06:36:40.4551060 at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:358) 2024-11-01T06:36:40.4552080 at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) 2024-11-01T06:36:40.4552480 at org.postgresql.jdbc.PgConnection.(PgConnection.java:273) 2024-11-01T06:36:40.4552860 at org.postgresql.Driver.makeConnection(Driver.java:446) 2024-11-01T06:36:40.4553230 at org.postgresql.Driver.connect(Driver.java:298) 2024-11-01T06:36:40.4553600 at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:683) 2024-11-01T06:36:40.4553970 at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:230) 2024-11-01T06:36:40.4554350 at com.inversoft.maintenance.db.JDBCMaintenanceModeDatabaseService.lockDatabase(JDBCMaintenanceModeDatabaseService.java:304) 2024-11-01T06:36:40.4555020 ... 6 common frames omitted 2024-11-01T06:36:40.4555340 Caused by: java.io.EOFException: null 2024-11-01T06:36:40.4555610 at org.postgresql.core.PGStream.receiveChar(PGStream.java:469) 2024-11-01T06:36:40.4555840 at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:594) 2024-11-01T06:36:40.4556060 at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:195) 2024-11-01T06:36:40.4556310 at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262) 2024-11-01T06:36:40.4556550 ... 13 common frames omitted 2024-11-01T06:36:40.7556620 2024-11-01 11:36:40.755 AM INFO io.fusionauth.api.configuration.DefaultFusionAuthConfiguration - Loading FusionAuth configuration file [/usr/local/fusionauth/config/fusionauth.properties] 2024-11-01T06:36:40.7567210 2024-11-01 11:36:40.756 AM INFO io.fusionauth.api.configuration.DefaultFusionAuthConfiguration - Dynamically set property [fusionauth-app.url] set to [http://192.168.1.116:9011/] 2024-11-01T06:36:40.7568350 2024-11-01 11:36:40.756 AM INFO com.inversoft.configuration.BasePropertiesFileInversoftConfiguration - 2024-11-01T06:36:40.7568950 - Overriding default value of property [database.mysql.enforce-utf8mb4] with value [true] 2024-11-01T06:36:40.7569460 - Overriding default value of property [fusionauth-app.runtime-mode] with value [development] 2024-11-01T06:36:40.7569740 - Overriding default value of property [search.type] with value [database] 2024-11-01T06:36:40.7570100 2024-11-01T06:36:40.8865620 2024-11-01 11:36:40.886 AM INFO com.inversoft.maintenance.MaintenanceModePoller - Poller started to Wait for configuration to be completed. 2024-11-01T06:36:40.8887740 2024-11-01 11:36:40.888 AM INFO io.fusionauth.app.primeframework.FusionHTTPContextAuthSetup - Initializing the FusionAuth HTTP Context. 2024-11-01T06:36:40.8987550 2024-11-01 11:36:40.898 AM INFO org.primeframework.mvc.PrimeMVCRequestHandler - Initializing Prime 2024-11-01T06:36:40.9000280 2024-11-01 11:36:40.899 AM INFO org.primeframework.mvc.PrimeMVCRequestHandler - Initializing Prime 2024-11-01T06:36:40.9008700 2024-11-01 11:36:40.900 AM INFO io.fusionauth.http.server.HTTPServer - Starting the HTTP server. Buckle up! 2024-11-01T06:36:40.9087310 2024-11-01 11:36:40.908 AM INFO io.fusionauth.http.server.HTTPServer - HTTP server listening on port [9011] 2024-11-01T06:36:40.9093130 2024-11-01 11:36:40.908 AM INFO io.fusionauth.http.server.HTTPServer - HTTP server started successfully 2024-11-01T06:36:40.9094320 2024-11-01 11:36:40.908 AM INFO io.fusionauth.http.server.HTTPServer - Starting the HTTP server. Buckle up! 2024-11-01T06:36:40.9094930 2024-11-01 11:36:40.909 AM INFO io.fusionauth.http.server.HTTPServer - HTTP server listening on port [9012] 2024-11-01T06:36:40.9095420 2024-11-01 11:36:40.909 AM INFO io.fusionauth.http.server.HTTPServer - HTTP server started successfully ```

The server goes into silent configuration mode and never recovers, and then never runs my kickstart.


Now, if I do the following temporary workaround:

After thirty seconds, FusionAuth starts and is able to connect to the database. So I know I have the potential for a working FusionAuth configuration and that there's nothing wrong with my setup. It's merely a matter of convincing FusionAuth to actually retry.


So, coming back to what you mention above - and provided I have the config value name correct - I'm not sure the timeout is either working or necessarily the right solution.

The central question here may not even be around whether there is a timeout configured, but more around what happens after a timeout.

In this, FusionAuth could present a number of behaviours, but ideally it might be good to allow people to pick which one is best for their environment. Especially when you include the first time out of box setup experience that FusionAuth offers, which in my scenario is actually not helpful because I'm using kickstart and API calls to complete configuration non-interactively.

mooreds commented 3 days ago

Thanks @atrauzzi .

I did some digging and it looks like the FUSIONAUTH_DATABASE_CONNECTION_TIMEOUT variable doesn't apply at startup when we're in maintenance mode, trying to find a database to connect to, only to connections after startup, when the database connection is managed by our connection pool.

We do try to reconnect multiple times when starting up but I'll take a closer look.