fiskaltrust / product-de-bring-your-own-datacenter

Information about the fiskaltrust "Bring your own Data Center" product, which enables hosting the fiskaltrust.Middleware in data centers.
5 stars 2 forks source link

Failed to upload to api/service/queue. Statuscode: InternalServerError #76

Open gregzip opened 1 year ago

gregzip commented 1 year ago

Describe the bug We're seeing multiple errors with almost no description, pretty much just Failed to upload to api/service/queue. Statuscode: InternalServerError.

To Reproduce Let pods process requests for some time.

Expected behavior Requests are processed correctly.

Screenshots N/A

STDOUT/STDERR

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
Failed to upload to api/service/queue. Statuscode: InternalServerError; Response message:

(yes, the response message is empty)

POSSystem (please complete the following information):

Cashbox Information (please complete the following information): Hard to say, we send hundreds of requests, some of them fail because of #74 and #75, and the error message doesn't contain it.

Additional context It's less important than #74 but still annoying as it triggers alerts on our side.

gregzip commented 1 year ago

I was able to isolate this problem on a Sunday when there's almost no other load and it seemed to be caused by one of the queues. I changed it's configuration to use the newest package version and sent a null echo, and the issue is now gone.

As far as the issue is concerned, I think we're happy to close this but I'd be interested to know why this worked. Are those configuration updates something we should do with other queues as well? Or is it about updating the config but not necessarily related to the configured package version? πŸ€”

I got the fix idea from this follow-up line in the logs:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09
gregzip commented 1 year ago

Unfortunately this wasn't a permanent fix - after some time both errors started showing up again for the same queue / cashbox.

gregzip commented 1 year ago

One more little update: these 2 kinds of errors: Failed to upload to api/service/queue. and Error while pushing configuration data for queue seem to be related (you know that better than me probably 😬) but they don't come in pairs. We sometimes get quite a few Failed to upload to api/service/queue. for none or single instance of the latter. This seems to suggest other queues / cashboxes are affected as well but the logging message doesn't say which ones.

volllly commented 1 year ago

Hey, thanks for the update. With the added information we were able to find the problem πŸŽ‰

We'll release a fix this week or the begining of next week. I'll notify you here once its out.

volllly commented 1 year ago

Hey, my estimation was off by a bit πŸ˜…

We've just released 1.3.46-bullseye and 1.3.46-jammy including a bugfix of this issue πŸš€

gregzip commented 1 year ago

Hey! Just released 1.3.46-bullseye and so far it seems there's less of these errors but they're not 100% gone:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09 (UrlType: rest; Url: rest://localhost:1500/bab0b7ef-9595-436a-bc94-f1e580da6a09)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadQueueListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
--
gregzip commented 1 year ago

Hey! Any updates on this? I was doing some cleanup work and ran a couple of queries to see how common is the issue.

This is the graph with messages matching /Error while pushing configuration data/ and /Failed to upload/ excluded: image

This is the graph with those 2 included: image

Not only these errors are still there, they also make up most of the errors we're seeing in logs at the moment.

volllly commented 1 year ago

Hey, we've also not forgotten about this one 😁 we just need to find the time to tackle it

gregzip commented 1 year ago

Hey! Any news on this?

mertcanyalhi commented 1 year ago

We've started to observe the same issue after upgrading from 1.3.44 (1.3-buster) to 1.3.46 (1.3-bullseye). Apart from the errors mentioned above, we're observing the following error:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing entity data (341, journalde)
      System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
         at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)

After digging into this issue further, we noticed that the requests against Helipad (https://helipad.fiskaltrust.cloud/) are failing (e.g. while performing HelipadService.GetPointerAsync or HelipadService.UploadDataRequestAsync, which uses the HttpClient for communicating with Helipad).

When this issue is observed, our requests against the ByoDC middleware are timing out (our HTTP client timeout is set to 10 seconds due to operational reasons); can we please ask for your input/help on this?

volllly commented 11 months ago

Hey, can you check if these problems are fixed in the 1.3.49-bullseye/1.3.49-jammy releases?

mertcanyalhi commented 11 months ago

Hey @volllly,

Timeouts

5xx responses in logs

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
      Failed to upload to api/service/queuede. Statuscode: InternalServerError; Response message: 
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue 73257a4a-70a6-4f8d-89cd-581e82eae30d (UrlType: rest; Url: rest://localhost:1500/73257a4a-70a6-4f8d-89cd-581e82eae30d)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadeQueueDEListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
      Failed to upload to api/service/signaturcreationunitde. Statuscode: ServiceUnavailable; Response message: 
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue 83b65314-d793-49e4-a56f-634f7c650c8b (UrlType: rest; Url: rest://localhost:1500/83b65314-d793-49e4-a56f-634f7c650c8b)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadSignatureCreationUnitDEListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing entity data (3, queueitem)
      System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
         at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)
mertcanyalhi commented 11 months ago

Hey @volllly, we tried to find a correlation between the ByoDC request timeouts and the errors, and noticed the following:

At this point, we're not sure if the timeouts are caused by the Helipad uploads, but just wanted to provide input. We'll change the log verbosity to see if we can catch any further logs.

mertcanyalhi commented 11 months ago

Hey @volllly, we observed a similar behavior; once we restarted the pods, we did not have any request timeouts against ByoDC for ~3 days. After 3 days, we started to observe timeouts, and one of the pods restarted after it became unhealthy.

Before the pod restarted, there was a small spike in the memory usage (the chart given below shows the memory usage percentage where 100% corresponds to 1Gi):

image

After the pod restarted, the spikes in CPU usages dropped even though the pod was receiving the exact same traffic for 13 out of 15 cashboxes:

image

The CPU usage seems to have increasing number of spikes over time for the affected pod:

image

And there is an increase in CPU usage spikes for all pods, but it depends on the # of transactions processed by the cashboxes they currently hold:

image

It looks like the increase in average CPU usage can also be grouped in the following clusters:

image

We also observed an increase in memory usage over time, which may be an indicator of a leak πŸ€”:

image

There are spikes in the increases that are caused by zero receipt requests (0x4445000000000002) with TSE-device TAR file download request flag (0x0000000002000000):

image

Attached logs for the pod that was restarted below:

byodc-7f84d56544-6k8fl.log

volllly commented 11 months ago

Hey, sorry for not getting back to you earlier, could you possibly provide us with a memory dump of a pod when it is experiencing the problems? (you can upload it here).

FYI Here is a KBA on how to get a memory dump.

volllly commented 11 months ago

@gregzip @mertcanyalhi we've released 1.3.50-rc1-jammy/1.3.50-rc1-bullseye which should fix some of the issues with timeouts and memory usage.

volllly commented 11 months ago

Also we just deployed an update to the sandbox helipad that should fix the 500/503 upload errors.

mertcanyalhi commented 10 months ago

@volllly Unfortunately, 1.3.50-rc1-bullseye did not solve the issue for us. We're still observing timeouts, increasing memory usage, and InternalServerError/ServiceUnavailable responses.

image

We tried to get a memory dump multiple times, but unfortunately, we're hitting the memory limit when we're getting a dump after the memory usage increases. We were able to get a dump from a pod that has the lowest amount of memory usage, and uploaded it, but I'm not sure if it'll help. We'll increase the memory limit once again, and will try to get a memory dump once we observe increasing memory usage.

mertcanyalhi commented 10 months ago

@volllly Uploaded more memory dumps for multiple pods that were captured in the past 3 days. We're observing InternalServerError responses from Helipad in the logs, and also increased latency for the requests against ByoDC when Helipad returns InternalServerError responses.

mertcanyalhi commented 10 months ago

@volllly The following graph shows the memory usage of a ByoDC pod, and once the memory usage started to increase, we started to observe the following more frequently:

image

mertcanyalhi commented 10 months ago

@volllly We noticed that the # of SELECT queries is growing with signature requests over time, and there are even huge amount of SELECT queries in idle. When the pods are restarted, the # of SELECTΒ queries drop to 0, and start to increase when there is traffic.

The growing # of SELECT queries and the drops after pod restarts can be observed from the chart given below:

image

Yesterday while there was low traffic (85 signature requests in total from 3 cashboxes), we captured the SELECT queries (orange line on the chart ^).

The following breakdown shows the # of SELECTΒ  queries in a 2 minute window when there was only a single signature request:

SELECT Queries: 3903
 |_ `select * from ftCashBox`: 279
 |_ `select * from ftQueue`: 280
 |_ `select * from ftQueueAT`: 279
 |_ `select * from ftQueueDE`: 280
 |_ `select * from ftQueueFR`: 279
 |_ `select * from ftQueueME`: 279
 |_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 279
 |_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftSignaturCreationUnitAT`: 278
 |_ `select * from ftSignaturCreationUnitDE`: 278
 |_ `select * from ftSignaturCreationUnitFR`: 278
 |_ `select * from ftSignaturCreationUnitME`: 278
 |_ `Select * from OpenTransaction where cbReceiptReference = {0};`: 2

The following breakdown shows the # of SELECTΒ  queries in a 2 minute window after the pods were restarted (6 workers were initialized after the pods were restarted):

SELECT Queries:
|_ `select * from ftCashBox`: 4
|_ `select * from ftQueue`: 4
|_ `select * from ftQueueAT`: 4
|_ `select * from ftQueueDE`: 4
|_ `select * from ftQueueFR`: 4
|_ `select * from ftQueueME`: 4
|_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftSignaturCreationUnitAT`: 4
|_ `select * from ftSignaturCreationUnitDE`: 4
|_ `select * from ftSignaturCreationUnitFR`: 4
|_ `select * from ftSignaturCreationUnitME`: 4
volllly commented 10 months ago

Hey, we've pushed a fix to helipad you should not see 500 and 503 errors anymore.

TSchmiedlechner commented 8 months ago

Hey @mertcanyalhi, are you still experiencing this issue, or did the fix deployed in November solve the problem?

mertcanyalhi commented 8 months ago

Hey @TSchmiedlechner

We're currently using 1.3.54-bullseye.

gregzip commented 7 months ago

We're still seeing heaps of errors daily like:

fail: fiskaltrust.SignatureCloud.DE.Services.HelipadUploader[0]
      Error while pushing entity data (2, receiptjournal)
      System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
         at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
         at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

At some point I just stopped looking at them and decided to only act when clients are seriously affected, but it's something that'd be good to resolve - the very reason we monitor this is to avoid situations where a client has to call us πŸ™ƒ

(we're at 1.3.55-rc1, also restarting pods daily so that they don't die one by one during the day)