`SqlRuntime.DataContext.addCache` timeout on first query when application starts

Choc13 commented 1 year ago

Describe the bug I'm not really sure this is a bug, I'm more looking to better understand the startup / caching behaviour of the provider and looking for some advice / knowledge to help resolve a timeout issue at startup.

Recently, I've noticed that some of our background jobs that query the database immediately at startup are failing due to query timeouts. The actual SQL queries they are trying to execute are very simple and have small results sets. I've tested the SQL that is generated and it's perfectly fine (taking less than 1s to complete when I run it manually). However, the stack trace from the SQL timeout seems to show that it's due to calling the SqlRuntime.DataContext.addCache function.

The top of the stack trace is:

MySql.Data.MySqlClient.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
 ---> MySql.Data.MySqlClient.MySqlException (0x80004005): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.TimeoutException: The operation has timed out.
   at MySql.Data.Common.StreamCreator.GetTcpStream(MySqlConnectionStringBuilder settings, MyNetworkStream& networkStream)
   at MySql.Data.Common.StreamCreator.GetStream(MySqlConnectionStringBuilder settings, MyNetworkStream& networkStream)
   at MySql.Data.MySqlClient.NativeDriver.Open()
   at MySql.Data.MySqlClient.NativeDriver.Open()
   at MySql.Data.MySqlClient.Driver.Open()
   at MySql.Data.MySqlClient.Driver.Create(MySqlConnectionStringBuilder settings)
   at MySql.Data.MySqlClient.MySqlPool.CreateNewPooledConnection()
   at MySql.Data.MySqlClient.MySqlPool.GetPooledConnection()
   at MySql.Data.MySqlClient.MySqlPool.TryToGetDriver()
   at MySql.Data.MySqlClient.MySqlPool.GetConnection()
   at MySql.Data.MySqlClient.MySqlConnection.Open()
   at <StartupCode$FSharp-Data-SqlProvider>.$SqlRuntime.DataContext.addCache@38-1.Invoke(Unit unitVar)
   at System.Lazy`1.ViaFactory(LazyThreadSafetyMode mode)
--- End of stack trace from previous location ---
   at System.Lazy`1.CreateValue()
   at FSharp.Data.Sql.Runtime.SqlDataContext..ctor(String typeName, String connectionString, DatabaseProviderTypes providerType, String resolutionPath, String[] referencedAssemblies, String runtimeAssembly, String owner, CaseSensitivityChange caseSensitivity, String tableNames, String contextSchemaPath, OdbcQuoteCharacter odbcquote, SQLiteLibrary sqliteLibrary, TransactionOptions transactionOptions, FSharpOption`1 commandTimeout, SelectOperations sqlOperationsInSelect, String ssdtPath)

What I'd like to know is what the purpose of this addCache function is and whether there is anything I can do to speed it up / disable it if possible. Judging by the name and looking at the source code it seems to me like it's trying to build an in memory version of the current DB schema. This is something that I thought the provider only did at design time, but I'm guessing it's also doing it at runtime to figure out the column names etc that it can fetch from each table?

Another thing to point out is that I have several jobs all making similar queries concurrently at startup, so wondered if this was perhaps being made worse due to a race condition and whether or not the providerCache is thread-safe?

Any insights into this function would be much appreciated so that I can try and figure out a strategy to avoid this timeout issue.

One thought was could we perhaps run it at compile time (maybe with an opt-in switch) so that the schema is already "built" to avoid the startup overhead of having to figure it out dynamically again?

Finally, does the IndividualsCount affect the performance of this at all? We don't use individuals and I wondered if maybe it was scanning the DB to build all that metadata at startup, which could be quite time consuming.

Version Information SQLProvider = 1.3.8

Thorium commented 1 year ago

Yes, as far as I know, this should be erasing type provider and thus the schema shouldn't be needed on runtime, only on compile time.

Choc13 commented 1 year ago

OK interesting. So does that mean this is a bug and that this bit of code shouldn't be being triggered at runtime? It is in a module called SqlRuntime though which made me think it was perhaps required at runtime somehow? I'd be happy to open a PR if this should be disabled though.

Thorium commented 1 year ago

The cache seems to be there already in pezipink code from 2014, so definitely it must be used for something...

Thorium commented 1 year ago

Do you have any idea why it would timeout:

Firewall?
Permissions to DB?
I wouldn't expect your schema being so big?
Is there something locking something? This should be read-only query.

Choc13 commented 1 year ago

Yes having a look through the code more closely it does seem like the cache is needed as it uses the metadata for things like constructing the query parameters.

The timeout I see is a strange one though. I'll give you some more context. When our application starts it registers a bunch of BackgroundService jobs. Most of which will do some fairly trivial queries to get some job metadata before starting their main body of work. All of these jobs are designed so that if they fail they will attempt to restart again between 1 and 2 minutes from now (the jitter is added to avoid any subsequent dog piling).

I only ever observe the timeout on the very first invocation of these jobs. Once they recover and restart a minute or so later they always boot up fine without incurring the timeout exception.

My first thought was that it must be because this addCache method is therefore being called by lots of threads concurrently at startup and therefore it's exhausting connections or something. Looking at the source code for addCache more closely I see that it uses the "lazy factory" approach inside the GetOrAdd call for the ConcurrentDictionary which means it should be guaranteed that all concurrent invocations are actually awaiting the result of a single invocation of the factory function. This would also explain why all of the jobs initially fail at exactly the same time when I start the application because the result of that first invocation of addCache fails and they all share this result.

That lead me to my current hypothesis which is that because addCache just calls con.Open() and not OpenAsync() that it is possibly to do with the fact that this is a blocking call? But I can't quite construct a logical argument for why that would lead to a timeout.

Thorium commented 1 year ago

The addCache should definitely called only once within process and yes the con.Open() is synchronous within that locked context.

But are "All of these jobs" running within same process (executable) or are they separate instances? Because the ConcurrentDictionary is shared only within that one process.

Choc13 commented 1 year ago

Yes they're all running the in same process. Just a collection of BackgroundService for which the runtime calls ExecuteAsync on to start them. It's just strange to me that it's the opening of the connection that actually times out. It's not even got on to the part of querying the schema yet.

Choc13 commented 1 year ago

I think I've possibly gotten to the bottom of this. It's related to a gotcha with how BackgroundService instances are started. Also BackgroundService instances do not wait for the application to be fully started before they can start processing their work. We are also using Quartz.NET which also uses the BackgroundService framework but does in fact wait for the application to be full initialised. I think a combination of these things together meant that our jobs were trying to open connections to the DB only for the runtime to then suspend execution of their threads whilst it waited for all other jobs to begin and I think this then meant that MySQL thought our application was dead and sent some timeout response.

I've fixed our jobs now so that they also wait for the IHostApplicationLifetime.ApplicationStarted event to fire before trying to do their processing and I believe things are fine now.

fsprojects / SQLProvider

`SqlRuntime.DataContext.addCache` timeout on first query when application starts #798