Open gregzip opened 1 year ago
I was able to isolate this problem on a Sunday when there's almost no other load and it seemed to be caused by one of the queues. I changed it's configuration to use the newest package version and sent a null echo, and the issue is now gone.
As far as the issue is concerned, I think we're happy to close this but I'd be interested to know why this worked. Are those configuration updates something we should do with other queues as well? Or is it about updating the config but not necessarily related to the configured package version? π€
I got the fix idea from this follow-up line in the logs:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09
Unfortunately this wasn't a permanent fix - after some time both errors started showing up again for the same queue / cashbox.
One more little update: these 2 kinds of errors: Failed to upload to api/service/queue.
and Error while pushing configuration data for queue
seem to be related (you know that better than me probably π¬) but they don't come in pairs. We sometimes get quite a few Failed to upload to api/service/queue.
for none or single instance of the latter. This seems to suggest other queues / cashboxes are affected as well but the logging message doesn't say which ones.
Hey, thanks for the update. With the added information we were able to find the problem π
We'll release a fix this week or the begining of next week. I'll notify you here once its out.
Hey, my estimation was off by a bit π
We've just released 1.3.46-bullseye
and 1.3.46-jammy
including a bugfix of this issue π
Hey! Just released 1.3.46-bullseye
and so far it seems there's less of these errors but they're not 100% gone:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09 (UrlType: rest; Url: rest://localhost:1500/bab0b7ef-9595-436a-bc94-f1e580da6a09)
fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadQueueListAsync(List`1 data)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
--
Hey! Any updates on this? I was doing some cleanup work and ran a couple of queries to see how common is the issue.
This is the graph with messages matching /Error while pushing configuration data/ and /Failed to upload/ excluded:
This is the graph with those 2 included:
Not only these errors are still there, they also make up most of the errors we're seeing in logs at the moment.
Hey, we've also not forgotten about this one π we just need to find the time to tackle it
Hey! Any news on this?
We've started to observe the same issue after upgrading from 1.3.44 (1.3-buster
) to 1.3.46 (1.3-bullseye
). Apart from the errors mentioned above, we're observing the following error:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing entity data (341, journalde)
System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)
After digging into this issue further, we noticed that the requests against Helipad (https://helipad.fiskaltrust.cloud/
) are failing (e.g. while performing HelipadService.GetPointerAsync
or HelipadService.UploadDataRequestAsync
, which uses the HttpClient
for communicating with Helipad).
When this issue is observed, our requests against the ByoDC middleware are timing out (our HTTP client timeout is set to 10 seconds due to operational reasons); can we please ask for your input/help on this?
Hey, can you check if these problems are fixed in the 1.3.49-bullseye
/1.3.49-jammy
releases?
Hey @volllly,
Timeouts
5xx responses in logs
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
Failed to upload to api/service/queuede. Statuscode: InternalServerError; Response message:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue 73257a4a-70a6-4f8d-89cd-581e82eae30d (UrlType: rest; Url: rest://localhost:1500/73257a4a-70a6-4f8d-89cd-581e82eae30d)
fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadeQueueDEListAsync(List`1 data)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
Failed to upload to api/service/signaturcreationunitde. Statuscode: ServiceUnavailable; Response message:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue 83b65314-d793-49e4-a56f-634f7c650c8b (UrlType: rest; Url: rest://localhost:1500/83b65314-d793-49e4-a56f-634f7c650c8b)
fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadSignatureCreationUnitDEListAsync(List`1 data)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing entity data (3, queueitem)
System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)
Hey @volllly, we tried to find a correlation between the ByoDC request timeouts and the errors, and noticed the following:
Failed to upload to api/service/queue. Statuscode: InternalServerError
At this point, we're not sure if the timeouts are caused by the Helipad uploads, but just wanted to provide input. We'll change the log verbosity to see if we can catch any further logs.
Hey @volllly, we observed a similar behavior; once we restarted the pods, we did not have any request timeouts against ByoDC for ~3 days. After 3 days, we started to observe timeouts, and one of the pods restarted after it became unhealthy.
Before the pod restarted, there was a small spike in the memory usage (the chart given below shows the memory usage percentage where 100% corresponds to 1Gi):
After the pod restarted, the spikes in CPU usages dropped even though the pod was receiving the exact same traffic for 13 out of 15 cashboxes:
The CPU usage seems to have increasing number of spikes over time for the affected pod:
And there is an increase in CPU usage spikes for all pods, but it depends on the # of transactions processed by the cashboxes they currently hold:
It looks like the increase in average CPU usage can also be grouped in the following clusters:
We also observed an increase in memory usage over time, which may be an indicator of a leak π€:
There are spikes in the increases that are caused by zero receipt requests (0x4445000000000002
) with TSE-device TAR file download request flag (0x0000000002000000
):
Attached logs for the pod that was restarted below:
@gregzip @mertcanyalhi we've released 1.3.50-rc1-jammy
/1.3.50-rc1-bullseye
which should fix some of the issues with timeouts and memory usage.
Also we just deployed an update to the sandbox helipad that should fix the 500/503 upload errors.
@volllly Unfortunately, 1.3.50-rc1-bullseye
did not solve the issue for us. We're still observing timeouts, increasing memory usage, and InternalServerError
/ServiceUnavailable
responses.
We tried to get a memory dump multiple times, but unfortunately, we're hitting the memory limit when we're getting a dump after the memory usage increases. We were able to get a dump from a pod that has the lowest amount of memory usage, and uploaded it, but I'm not sure if it'll help. We'll increase the memory limit once again, and will try to get a memory dump once we observe increasing memory usage.
@volllly Uploaded more memory dumps for multiple pods that were captured in the past 3 days.
We're observing InternalServerError
responses from Helipad in the logs, and also increased latency for the requests against ByoDC when Helipad returns InternalServerError
responses.
@volllly The following graph shows the memory usage of a ByoDC pod, and once the memory usage started to increase, we started to observe the following more frequently:
GET /api/version
)InternalServerError
responses from Helipad)@volllly We noticed that the # of SELECT
queries is growing with signature requests over time, and there are even huge amount of SELECT
queries in idle. When the pods are restarted, the # of SELECT
Β queries drop to 0, and start to increase when there is traffic.
The growing # of SELECT
queries and the drops after pod restarts can be observed from the chart given below:
Yesterday while there was low traffic (85 signature requests in total from 3 cashboxes), we captured the SELECT
queries (orange line on the chart ^).
The following breakdown shows the # of SELECT
Β queries in a 2 minute window when there was only a single signature request:
SELECT Queries: 3903
|_ `select * from ftCashBox`: 279
|_ `select * from ftQueue`: 280
|_ `select * from ftQueueAT`: 279
|_ `select * from ftQueueDE`: 280
|_ `select * from ftQueueFR`: 279
|_ `select * from ftQueueME`: 279
|_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
|_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
|_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 279
|_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
|_ `select * from ftSignaturCreationUnitAT`: 278
|_ `select * from ftSignaturCreationUnitDE`: 278
|_ `select * from ftSignaturCreationUnitFR`: 278
|_ `select * from ftSignaturCreationUnitME`: 278
|_ `Select * from OpenTransaction where cbReceiptReference = {0};`: 2
The following breakdown shows the # of SELECT
Β queries in a 2 minute window after the pods were restarted (6 workers were initialized after the pods were restarted):
SELECT Queries:
|_ `select * from ftCashBox`: 4
|_ `select * from ftQueue`: 4
|_ `select * from ftQueueAT`: 4
|_ `select * from ftQueueDE`: 4
|_ `select * from ftQueueFR`: 4
|_ `select * from ftQueueME`: 4
|_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftSignaturCreationUnitAT`: 4
|_ `select * from ftSignaturCreationUnitDE`: 4
|_ `select * from ftSignaturCreationUnitFR`: 4
|_ `select * from ftSignaturCreationUnitME`: 4
Hey, we've pushed a fix to helipad you should not see 500 and 503 errors anymore.
Hey @mertcanyalhi, are you still experiencing this issue, or did the fix deployed in November solve the problem?
Hey @TSchmiedlechner
SELECT
queries/growing memory usage - we are now restarting ByoDC pods daily, but we can still see a daily increase pattern in # of SELECT
queries & memory usageWe're currently using 1.3.54-bullseye
.
We're still seeing heaps of errors daily like:
fail: fiskaltrust.SignatureCloud.DE.Services.HelipadUploader[0]
Error while pushing entity data (2, receiptjournal)
System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
At some point I just stopped looking at them and decided to only act when clients are seriously affected, but it's something that'd be good to resolve - the very reason we monitor this is to avoid situations where a client has to call us π
(we're at 1.3.55-rc1
, also restarting pods daily so that they don't die one by one during the day)
Describe the bug We're seeing multiple errors with almost no description, pretty much just
Failed to upload to api/service/queue. Statuscode: InternalServerError
.To Reproduce Let pods process requests for some time.
Expected behavior Requests are processed correctly.
Screenshots N/A
STDOUT/STDERR
(yes, the response message is empty)
POSSystem (please complete the following information):
Cashbox Information (please complete the following information): Hard to say, we send hundreds of requests, some of them fail because of #74 and #75, and the error message doesn't contain it.
Additional context It's less important than #74 but still annoying as it triggers alerts on our side.