allegro / turnilo

Business intelligence, data exploration and visualization web application for Druid, formerly known as Swiv and Pivot
https://allegro.github.io/turnilo/
Apache License 2.0
730 stars 174 forks source link

Druid requests go from <100ms latency to 5,10,15 second latency a few minutes after startup. Possible connection limit issue? #1098

Closed tshallenberger closed 7 months ago

tshallenberger commented 8 months ago

We run Turnilo v1.40.2 in a Docker container, and it uses Plywood to talk to a Druid cluster. When we enable verbose logging in Plywood, we see requests go from

Requester rq06461 got result from query 269: (in 56ms)
[
  {
    "maxTime": "2024-03-13T23:55:00.000Z"
  }
]

to

Requester rq06461 got result from query 352: (in 10019ms)
[
  {
    "maxTime": "2024-03-14T15:25:00.000Z"
  }
]
TimeMonitor Got the latest time for 'REDACTED' (2024-03-14T15:25:00.000Z)
vvvvvvvvvvvvvvvvvvvvvvvvvv
Requester rq06461 got result from query 350: (in 15283ms)
[
  {
    "maxTime": "2024-03-14T15:25:00.000Z",
    "minTime": "2024-03-12T15:00:00.000Z",
    "timestamp": "2024-03-12T15:00:00.000Z"
  }
]

approximately 2-4 minutes after the container starts. I'm trying to track down the source of the issue, figured I'd raise an issue here to see if anyone had any input. Not sure if this is a Docker (podman), Turnilo, or Plywood issue.

tshallenberger commented 7 months ago

This issue was determined to be an issue with how the containers were deployed using podman-kube play files and ran with systemd on RHEL8. Upgrading the kube file generated seemed to fix the issue? The diff between the two config files seemed to only drop some security context capabilities, as the containers were running on an SELinux enabled machine in rootless mode.