fiskaltrust / product-de-bring-your-own-datacenter

Information about the fiskaltrust "Bring your own Data Center" product, which enables hosting the fiskaltrust.Middleware in data centers.
5 stars 2 forks source link

MySQL connection timeouts and pod restarts #74

Open gregzip opened 1 year ago

gregzip commented 1 year ago

Describe the bug Some time after starting them, requests to pods start failing and MySQL connection issues show up in logs. These pods frequently fail their health checks and are eventually restarted by K8s.

To Reproduce Let pods process requests for some time.

Expected behavior Requests are processed correctly.

Screenshots N/A

STDOUT/STDERR

fail: fiskaltrust.Middleware.Queue.JournalProcessor[0]
      An error occured while processing the Journal request.
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue de44da25-f040-47d7-81c9-b573a85e0318 (UrlType: rest; Url: rest://localhost:1500/de44da25-f040-47d7-81c9-b573a85e0318)
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)
         at fiskaltrust.Middleware.Queue.Queue.Journal(Int64 ftJournalType, Int64 from, Int64 to)
         at fiskaltrust.Middleware.Helper.Helipad.AsyncEnumerablePOSWrapper.JournalAsync(JournalRequest request)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)

POSSystem (please complete the following information):

Cashbox Information (please complete the following information): The problem doesn't seem cashbox-dependent at all. Some examples:

Additional context We have 1k+ active cashboxes.

The less instances we start, the sooner we start seeing these errors. With 3 instances errors started showing up within minutes of starting them. With 20 instances it took a few hours.

After adding new instances, old ones are still experiencing the errors until they are restarted.

The database connection limit is 2000 and the number of active connection is always below 700. The errors were also showing up when we temporarily spun up a 96 CPU database instance, so a database issue is very unlikely.

gregzip commented 1 year ago

We just tried testing a hypothesis where some connections are leaked and perhaps blocking the connection pool or something. We set our MySQL instance to auto-terminate idle connections after 30s and while the number of concurrent connections dropped, the problem is still there.

volllly commented 1 year ago

Hey, we'll look into these three issues (#74, #75 and #76) this week.

Can you confirm that you are running into these with version 1.3.45-rc1-bullseye?

gregzip commented 1 year ago

Yes, exactly

On Tue, Apr 11, 2023 at 2:51 PM Paul Volavsek @.***> wrote:

Hey, we'll look into these three issues (#74 https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/74,

75

https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/75 and #76 https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/76) this week.

Can you confirm that you are running into these with 1.3.45-rc1-bullseye?

— Reply to this email directly, view it on GitHub https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/74#issuecomment-1503274731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZTOAQVW4UCMBWRNCJBMSG3XAVHWRANCNFSM6AAAAAAWSOZ5NM . You are receiving this because you authored the thread.Message ID: <fiskaltrust/product-de-bring-your-own-datacenter/issues/74/1503274731@ github.com>

-- Together, We Grow.​​​​ Grzegorz Ziemoński Engineering Manager [image: Facebook] https://www.facebook.com/phorestsalonsoftware/ [image: LinkedIn] https://www.linkedin.com/company/phorest/ [image: Twitter] https://twitter.com/thephorestword/ [image: Instagram] https://www.instagram.com/phorestsalonsoftware/ [image: Phorest Logo] https://www.phorest.com/ IE phorest.com https://www.phorest.com/ IE: +353 (0)1 8747800 UK: +44 (0)207 100 9290 US: +1 (406) 284‑7019 [image: Salon Owners Summit 2024] https://www.salonownersummit.com/ We work flexibly at Phorest, while it suits me to send this email now I do not expect a response outside of your own working hours

volllly commented 1 year ago

Hey, we were not yet able to reproduce this issue.

Do you experience this issue also when using only one instance. And could you get us an sqldump of the database of an affected queue (the database name is the queueid without dashes -). You can upload the dump here

forsthug commented 1 year ago

Hi Gregzip, please set the timeout in the connectionstring according to your requirements. Both connection & command timeout need to be set. .....;Connect Timeout=120;default command timeout=120; br

volllly commented 1 year ago

Hey @gregzip, if this problem still persists we've just released 1.3.46-rc1-bullseye which should improve stability and performance.

Can you please update and check if the problems prevail?

gregzip commented 1 year ago

Hi, it still persists. We were planning to switch from Aurora to regular MySQL to make sure these are not some exotic driver issues again. We'll have a go with 1.3.46-rc1-bullseye then.

Btw. What happened to 1.3.45-rc2-bullseye? We were recommended to use it as a fix for #75 but it now seems to be gone (and it caused an outage on our side as K8s was unable to pull the image from the registry anymore)

cc @fpenz @derekmcauley7

volllly commented 1 year ago

I apologize for that. 1.3.45-rc2-bullseye is available again.

There was an an issue in our CI/CD pipline yesterday. We've identified and fixed the problem so it will not occur again :)

gregzip commented 1 year ago

Hey! there was a slight delay in our verification due to limited load on Monday and Tuesday. after some further checks on 1.3.46-rc1-bullseye, the problem still occurs but is less frequent (actually takes a bit more time to show up, then it's consistently there until a pod restarts)

We're continuing with our attempt to migrate the DB but it has it's own challenges (it's 600+ GB at the moment). Is there a way to make it smaller somehow? Rebuilding some cashboxes or removing some unnecessary historical data perhaps? 😅

volllly commented 1 year ago

Hey, thanks for the update 😁

can you check how you see the message Worker created. cashboxid: ... in your logs in a typical day?

Also can you do a memory dump on a pod that is currently experiencing the problem (We have a guide here) and upload the dump and logs over a whole day from multiple pods here.

Regarding the DB migration if you provide us with a list of all queues in the database we can check if all data is already uploaded (or up to which entry it is uploaded) and you can then do a backup and remove the already uploaded entries.

Also a size analysis of the database would be interesting (size of each table for a representative queue). I suspect that the largest table is the ftJournalDE table. This could be a symptom of open transactions on the TSE. See this article on how to close open transactions before each daily closing or manual export.

gregzip commented 1 year ago

Hi!

Regarding the worker created message, I think it might not be representative because we were restarting all the pods every 30 mins to avoid the DB connection issues.

That said, those issues seem to be gone after we moved to MySQL community and increased the number of pods to 15. I think it's the latter that helped. My current hypothesis is that we were running into MySQLConnector's default pool size limit of 100 - we had 10 pods running for ~1300 active cashboxes. Would that make sense from your perspective?

As I said, we managed to migrate to regular MySQL server with all of that data in place. I had a quick look at the biggest tables and indeed it seems these are ftJournalDE

+----------------------------------+-------------+------------+-----------+------------+
| database_name                    | TABLE_NAME  | total_size | data_size | index_size |
+----------------------------------+-------------+------------+-----------+------------+
| d11f0f217e7c42359860ca52a786b353 | ftJournalDE |    3128.56 |   3128.52 |       0.05 |
| 2c87fccb64314d61923f113404fc528a | ftJournalDE |    3058.56 |   3058.52 |       0.05 |
| 4c040bc573b240b8b382192de1e70b01 | ftJournalDE |    2788.56 |   2788.52 |       0.05 |
| 8567cc8b02f548d482612a5177c05c0d | ftJournalDE |    2761.56 |   2761.52 |       0.05 |
| 229d32f98dbb4c7aa15d2242225f2892 | ftJournalDE |    2551.56 |   2551.52 |       0.05 |
| 79bec6a01fe84f3ead13dd402b43e45f | ftJournalDE |    2505.56 |   2505.52 |       0.05 |
| 3b2f12c90fb740e084fabdffe0eba9d4 | ftJournalDE |    2499.53 |   2499.52 |       0.02 |
| cfd306fd24c140a5be5e025815709a2d | ftJournalDE |    2494.56 |   2494.52 |       0.05 |
| 5c031a79d4d7470ab1b485713c3623db | ftJournalDE |    2442.56 |   2442.52 |       0.05 |
| 236b77b825a64a439115f9b1a76627aa | ftJournalDE |    2388.56 |   2388.52 |       0.05 |
+----------------------------------+-------------+------------+-----------+------------+

I checked that there are indeed open transactions for them and I tried to close them for a couple of cashboxes (and wait for daily closing and even re-creating the cashboxes today) but the size is still the same. 🤔

Some questions about this:

volllly commented 1 year ago

Hey, Yes, I think the default pool size could have been a factor.

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.


how long after closing open transactions should we expect the size of these tables to drop?

Sorry, I was not clear on that. Closing the open transactions prevents rapid growth of the database in the future but it will not shrink the database.

At the moment we don't have an automated way to delete historical data. What we can do though is check on our side which data was already uploaded (please provide us with a list of queues for that) and then you can remove the old data ftJournalDE entries which were already uploaded.

is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)

Yes, that is perfectly fine.

gregzip commented 1 year ago

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.

That sounds great 🙏

Yes, that is perfectly fine.

Okay, we'll do that soon-ish. I already closed all of the currently open ones via an ad-hoc script so I hope this gives us some time

gregzip commented 1 year ago

Hi! Any news on this? We'd be excited to check if the max pool size fixes our issues, as this could potentially allow us to use less compute resources and save quite a bit on infra costs 💸

volllly commented 1 year ago

Hey, we have not forgotten about you 😅 I hope we'll be able to tackle that in one of the next sprints

volllly commented 1 year ago

Hey, we've just released 1.3.47-rc1-bullseye which introduces the MaxPoolSize configuration parameter to the MySQL queue and uses 500 as the default value 🚀

gregzip commented 1 year ago

We tried the new version and it seems the issue still persists 😬 it was mostly fine for a couple of days with less instances, so I disabled our restarting cron and today things exploded again with the same errors as before.

For now we're back to lots of instances and a cron that restarts them every night.

gregzip commented 1 year ago

I did some investigation today and it seems that the pod started failing around 200-220 connections.

I'm going to re-create the scenario tomorrow and watch the CPU/memory usage more closely to be absolutely sure it's not about that. I should've done it before but I somehow assumed that if the entire node cpu is very low, then it's not an issue. But it seems that it's always an isolated pod that's failing, so that one pod might not be doing great there. I'll post my findings.

gregzip commented 1 year ago

Quick update:

It seems that the issue is not about connection pooling (anymore?) and it's not about computing resources either 😬

Edit: One more data point: I left 14 pods running for the night but forgot to turn on our nightly auto-restart, and some of the pods started failing in the morning with very high CPU usage. I checked and the usage kept growing throughout the night even though there was absolutely no transactions happening. This might point to something like a thread leak or some other uncleaned processes going on? 🤔

volllly commented 1 year ago

Hey, thanks for the detailed investigation. We'll investigate if we have a thread leak.

Could you also please get us one days worth of logs of all pods (with log level debug)? This would help us a lot in analyzing the situation 🙏

gregzip commented 1 year ago

hey! sorry for a late reply, we were swamped with other topics.

here are the logs, I think the format is a bit weird but that's what CloudWatch gave me 😕

pod1.txt pod2.txt pod3.txt pod4.txt pod5.txt pod6.txt pod7.txt pod8.txt pod9.txt pod10.txt pod11.txt pod12.txt pod13.txt pod14.txt pod15.txt

gregzip commented 1 year ago

Hey! Any news on this?

volllly commented 1 year ago

Hey, news yet, could also provide us with a memory dump of a pod when it is experiencing the problems? (you can upload it here).

FYI Here is a KBA on how to get a memory dump.

gregzip commented 1 year ago

hey there! I know we owe you a memory dump but this is not fixed yet, even with a larger pool size

volllly commented 1 year ago

Sorry, this was autoclosed by mistake