Open gregzip opened 1 year ago
We just tried testing a hypothesis where some connections are leaked and perhaps blocking the connection pool or something. We set our MySQL instance to auto-terminate idle connections after 30s and while the number of concurrent connections dropped, the problem is still there.
Hey, we'll look into these three issues (#74, #75 and #76) this week.
Can you confirm that you are running into these with version 1.3.45-rc1-bullseye
?
Yes, exactly
On Tue, Apr 11, 2023 at 2:51 PM Paul Volavsek @.***> wrote:
Hey, we'll look into these three issues (#74 https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/74,
75
https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/75 and #76 https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/76) this week.
Can you confirm that you are running into these with 1.3.45-rc1-bullseye?
— Reply to this email directly, view it on GitHub https://github.com/fiskaltrust/product-de-bring-your-own-datacenter/issues/74#issuecomment-1503274731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZTOAQVW4UCMBWRNCJBMSG3XAVHWRANCNFSM6AAAAAAWSOZ5NM . You are receiving this because you authored the thread.Message ID: <fiskaltrust/product-de-bring-your-own-datacenter/issues/74/1503274731@ github.com>
-- Together, We Grow. Grzegorz Ziemoński Engineering Manager [image: Facebook] https://www.facebook.com/phorestsalonsoftware/ [image: LinkedIn] https://www.linkedin.com/company/phorest/ [image: Twitter] https://twitter.com/thephorestword/ [image: Instagram] https://www.instagram.com/phorestsalonsoftware/ [image: Phorest Logo] https://www.phorest.com/ IE phorest.com https://www.phorest.com/ IE: +353 (0)1 8747800 UK: +44 (0)207 100 9290 US: +1 (406) 284‑7019 [image: Salon Owners Summit 2024] https://www.salonownersummit.com/ We work flexibly at Phorest, while it suits me to send this email now I do not expect a response outside of your own working hours
Hey, we were not yet able to reproduce this issue.
Do you experience this issue also when using only one instance.
And could you get us an sqldump of the database of an affected queue (the database name is the queueid without dashes -
). You can upload the dump here
Hi Gregzip, please set the timeout in the connectionstring according to your requirements. Both connection & command timeout need to be set. .....;Connect Timeout=120;default command timeout=120; br
Hey @gregzip, if this problem still persists we've just released 1.3.46-rc1-bullseye
which should improve stability and performance.
Can you please update and check if the problems prevail?
Hi, it still persists. We were planning to switch from Aurora to regular MySQL to make sure these are not some exotic driver issues again. We'll have a go with 1.3.46-rc1-bullseye
then.
Btw. What happened to 1.3.45-rc2-bullseye
? We were recommended to use it as a fix for #75 but it now seems to be gone (and it caused an outage on our side as K8s was unable to pull the image from the registry anymore)
cc @fpenz @derekmcauley7
I apologize for that. 1.3.45-rc2-bullseye
is available again.
There was an an issue in our CI/CD pipline yesterday. We've identified and fixed the problem so it will not occur again :)
Hey! there was a slight delay in our verification due to limited load on Monday and Tuesday. after some further checks on 1.3.46-rc1-bullseye
, the problem still occurs but is less frequent (actually takes a bit more time to show up, then it's consistently there until a pod restarts)
We're continuing with our attempt to migrate the DB but it has it's own challenges (it's 600+ GB at the moment). Is there a way to make it smaller somehow? Rebuilding some cashboxes or removing some unnecessary historical data perhaps? 😅
Hey, thanks for the update 😁
can you check how you see the message Worker created. cashboxid: ...
in your logs in a typical day?
Also can you do a memory dump on a pod that is currently experiencing the problem (We have a guide here) and upload the dump and logs over a whole day from multiple pods here.
Regarding the DB migration if you provide us with a list of all queues in the database we can check if all data is already uploaded (or up to which entry it is uploaded) and you can then do a backup and remove the already uploaded entries.
Also a size analysis of the database would be interesting (size of each table for a representative queue). I suspect that the largest table is the ftJournalDE
table. This could be a symptom of open transactions on the TSE. See this article on how to close open transactions before each daily closing or manual export.
Hi!
Regarding the worker created message, I think it might not be representative because we were restarting all the pods every 30 mins to avoid the DB connection issues.
That said, those issues seem to be gone after we moved to MySQL community and increased the number of pods to 15. I think it's the latter that helped. My current hypothesis is that we were running into MySQLConnector's default pool size limit of 100 - we had 10 pods running for ~1300 active cashboxes. Would that make sense from your perspective?
As I said, we managed to migrate to regular MySQL server with all of that data in place. I had a quick look at the biggest tables and indeed it seems these are ftJournalDE
+----------------------------------+-------------+------------+-----------+------------+
| database_name | TABLE_NAME | total_size | data_size | index_size |
+----------------------------------+-------------+------------+-----------+------------+
| d11f0f217e7c42359860ca52a786b353 | ftJournalDE | 3128.56 | 3128.52 | 0.05 |
| 2c87fccb64314d61923f113404fc528a | ftJournalDE | 3058.56 | 3058.52 | 0.05 |
| 4c040bc573b240b8b382192de1e70b01 | ftJournalDE | 2788.56 | 2788.52 | 0.05 |
| 8567cc8b02f548d482612a5177c05c0d | ftJournalDE | 2761.56 | 2761.52 | 0.05 |
| 229d32f98dbb4c7aa15d2242225f2892 | ftJournalDE | 2551.56 | 2551.52 | 0.05 |
| 79bec6a01fe84f3ead13dd402b43e45f | ftJournalDE | 2505.56 | 2505.52 | 0.05 |
| 3b2f12c90fb740e084fabdffe0eba9d4 | ftJournalDE | 2499.53 | 2499.52 | 0.02 |
| cfd306fd24c140a5be5e025815709a2d | ftJournalDE | 2494.56 | 2494.52 | 0.05 |
| 5c031a79d4d7470ab1b485713c3623db | ftJournalDE | 2442.56 | 2442.52 | 0.05 |
| 236b77b825a64a439115f9b1a76627aa | ftJournalDE | 2388.56 | 2388.52 | 0.05 |
+----------------------------------+-------------+------------+-----------+------------+
I checked that there are indeed open transactions for them and I tried to close them for a couple of cashboxes (and wait for daily closing and even re-creating the cashboxes today) but the size is still the same. 🤔
Some questions about this:
Hey, Yes, I think the default pool size could have been a factor.
We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.
how long after closing open transactions should we expect the size of these tables to drop?
Sorry, I was not clear on that. Closing the open transactions prevents rapid growth of the database in the future but it will not shrink the database.
At the moment we don't have an automated way to delete historical data. What we can do though is check on our side which data was already uploaded (please provide us with a list of queues for that) and then you can remove the old data ftJournalDE
entries which were already uploaded.
is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)
Yes, that is perfectly fine.
We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.
That sounds great 🙏
Yes, that is perfectly fine.
Okay, we'll do that soon-ish. I already closed all of the currently open ones via an ad-hoc script so I hope this gives us some time
Hi! Any news on this? We'd be excited to check if the max pool size fixes our issues, as this could potentially allow us to use less compute resources and save quite a bit on infra costs 💸
Hey, we have not forgotten about you 😅 I hope we'll be able to tackle that in one of the next sprints
Hey, we've just released 1.3.47-rc1-bullseye which introduces the MaxPoolSize
configuration parameter to the MySQL queue and uses 500 as the default value 🚀
We tried the new version and it seems the issue still persists 😬 it was mostly fine for a couple of days with less instances, so I disabled our restarting cron and today things exploded again with the same errors as before.
For now we're back to lots of instances and a cron that restarts them every night.
I did some investigation today and it seems that the pod started failing around 200-220 connections.
I'm going to re-create the scenario tomorrow and watch the CPU/memory usage more closely to be absolutely sure it's not about that. I should've done it before but I somehow assumed that if the entire node cpu is very low, then it's not an issue. But it seems that it's always an isolated pod that's failing, so that one pod might not be doing great there. I'll post my findings.
Quick update:
It seems that the issue is not about connection pooling (anymore?) and it's not about computing resources either 😬
Edit: One more data point: I left 14 pods running for the night but forgot to turn on our nightly auto-restart, and some of the pods started failing in the morning with very high CPU usage. I checked and the usage kept growing throughout the night even though there was absolutely no transactions happening. This might point to something like a thread leak or some other uncleaned processes going on? 🤔
Hey, thanks for the detailed investigation. We'll investigate if we have a thread leak.
Could you also please get us one days worth of logs of all pods (with log level debug)? This would help us a lot in analyzing the situation 🙏
Hey! Any news on this?
hey there! I know we owe you a memory dump but this is not fixed yet, even with a larger pool size
Sorry, this was autoclosed by mistake
Describe the bug Some time after starting them, requests to pods start failing and MySQL connection issues show up in logs. These pods frequently fail their health checks and are eventually restarted by K8s.
To Reproduce Let pods process requests for some time.
Expected behavior Requests are processed correctly.
Screenshots N/A
STDOUT/STDERR
POSSystem (please complete the following information):
Cashbox Information (please complete the following information): The problem doesn't seem cashbox-dependent at all. Some examples:
Additional context We have 1k+ active cashboxes.
The less instances we start, the sooner we start seeing these errors. With 3 instances errors started showing up within minutes of starting them. With 20 instances it took a few hours.
After adding new instances, old ones are still experiencing the errors until they are restarted.
The database connection limit is 2000 and the number of active connection is always below 700. The errors were also showing up when we temporarily spun up a 96 CPU database instance, so a database issue is very unlikely.