bcgov / common-object-management-service

A microservice for managing access control to S3 Objects
https://bcgov.github.io/common-object-management-service/
Apache License 2.0
6 stars 9 forks source link

Objection query promises do not handle errors #151

Closed pbolduc closed 1 year ago

pbolduc commented 1 year ago

Describe the bug

We are seeing issues with client time outs. This is the same situation as closed issue #134. During high query activity searching for files, we see client time outs. The client's default time out is 100 seconds. Calls to the COMS service never returns in various cases. and because it never returns, nothing is logged in the COMS console of the pods. Tracking it down, the most likely situation would be unresolved promises. This lead me to look at where promises are being resolved. I am not very proficient JavaScript, so I may not correct.

The problem definitely occurs when the system is under load. For example, when we are getting about 20-30 requests/second to COMS, we start to get time outs,

image

In the various services that query the database, the then handler does not provide a errorHandler parameter. By not handling the errors, the service could return SQL and other information about the application. See Error handling in the Objection documenation.

Additionally it seems, the searchObjects controller function does not map the error using errorToProblem like other methods.

Version Number

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Screenshots

Desktop (please complete the following information):

Smartphone (please complete the following information):

Additional context

pbolduc commented 1 year ago

Edit: Looking back at the history, this appears to be the only recent occurrence.

We see the COMS pod reports unhealthy during this time.

image

TimCsaky commented 1 year ago

Hey Phil. I'm looking into these timeouts this week. I suspect it may be a combination of sql queries and the way we handle unresolved javascript promises. Are you able to let me know the resource allocation to your Patroni cluster (eg: cpu and memory, number of pods), or if you use a single postgres pod?

Is it mainly the searchObjects api call that gives timeouts under high load?

Thanks

jujaga commented 1 year ago

Closing this issue as the error conditions described are not repeatable. That being said, we did do an internal performance review pass and added a few indexes to the permission tables to improve lookup speed in #162 . Should a similar issue appear again in the upcoming COMS v0.4.1 release, please feel free to request this issue to be reopened or to file a new issue.