falcosecurity / charts

Community managed Helm charts for running Falco with Kubernetes
Apache License 2.0
237 stars 284 forks source link

Falcosidekick-ui Redis Error: interface conversion: interface {} is nil, not string #463

Closed j771 closed 1 year ago

j771 commented 1 year ago

Falco UI runs well on initial deployment but after running for a day the ui pods get the following error when trying to load the UI. The pods do not recover unless the redis storage is wiped out. The redis is currently over provisioned and not running out of disk space or memory.

Error log from falcosidekick-ui pod:

echo: http: panic serving <IP>:<PORT>: interface conversion: interface {} is nil, not string
goroutine 399025 [running]:
net/http.(*conn).serve.func1()
    go/src/net/http/server.go:1825 +0xbf
panic({0x94e1c0, 0xc001e65da0})
    go/src/runtime/panic.go:844 +0x258
github.com/falcosecurity/falcosidekick-ui/internal/database/redis.CountKeyBy(0x7fdff87806f8?, 0xc0002b7540)
    /home/circleci/project/internal/database/redis/count.go:43 +0x7c5
github.com/falcosecurity/falcosidekick-ui/internal/events.CountBy(0x9d75e0?)
    /home/circleci/project/internal/events/count.go:24 +0x4a
github.com/falcosecurity/falcosidekick-ui/internal/api.CountByEvent({0xac1300, 0xc0002d2e60})
    /home/circleci/project/internal/api/api.go:98 +0x1f6
github.com/labstack/echo/v4/middleware.BasicAuthWithConfig.func1.1({0xac1300, 0xc0002d2e60})
    pkg/mod/github.com/labstack/echo/v4@v4.9.0/middleware/basic_auth.go:93 +0x4a5
github.com/labstack/echo/v4.(*Echo).add.func1({0xac1300, 0xc0002d2e60})
    pkg/mod/github.com/labstack/echo/v4@v4.9.0/echo.go:536 +0x51
github.com/labstack/echo/v4.(*Echo).ServeHTTP(0xc0002aad80, {0xabdd48?, 0xc0018cb880}, 0xc0019c7500)
    pkg/mod/github.com/labstack/echo/v4@v4.9.0/echo.go:646 +0x3bc
net/http.serverHandler.ServeHTTP({0xc001f5b4d0?}, {0xabdd48, 0xc0018cb880}, 0xc0019c7500)
    go/src/net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc001f86000, {0xabe120, 0xc00014f530})
    go/src/net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    go/src/net/http/server.go:3071 +0x4db

Kubernetes v1.22 Deployed with Helm chart falcosecurity/falco --version 2.5.4 Helm values file:

falcosidekick:
  enabled: true
  webui:
    enabled: true
    ttl: 604800
    redis:
      storageSize: 30Gi
      resources:
        limits:
          cpu: 800m
          memory: 12Gi
        requests:
          cpu: 300m
          memory: 10Gi
Issif commented 1 year ago

This is strange, I've a pod running for months without facing that issue. Have you any idea about the count of events that were stored before the crash? It might be related to a full disk for the PVC.

j771 commented 1 year ago

As stated in the initial post storage is over provisioned, the PVC has 30gb available and is only using a few gb. There are many events though, 10,000+.

Issif commented 1 year ago

I injected more 200k+ events without a glitch. The logs seem to link that to echo, the web framework I used but I don't see the link with redis. Have you tried to just delete the ui pods and not redis?

j771 commented 1 year ago

Yes 1st I deleted the ui pods and then loaded the ui again when the pods came up. As soon as I try to load the UI those errors come up and the UI does not load any data just hangs.

j771 commented 1 year ago

The redis looks fine, no error logs, no space issues. So it seems like it is just an error with the UI trying to load events.

Issif commented 1 year ago

Really strange, tell me if it happens again, you're the first one noticing that issue. Thanks for the report

j771 commented 1 year ago

I am just evaluating this service in my dev environment when I have free time and I can play around with settings. Are there any recommendations for debugging or testing this further? When I deploy this again I will set a TTL to for 24 hours or something less than I set before and watch the event numbers and the size of the redis as well as check if anything specific triggers the errors.

Issif commented 1 year ago

Except trying to set a ttl I don't have any advice for now, sorry

j771 commented 1 year ago

Same issue when setting a 24hour ttl. Redis keyspace at the time I tried to access the UI and received the error: db0:keys=661512,expires=661512,avg_ttl=57529256 and using 3.2gb of memory

Issif commented 1 year ago

660k+ keys, I never tried that much. I will replicate and see. Thanks.

Btw, if you have so much events in so little time, you should tweak your rules, you're not fishing with a fishing rod but a net.

j771 commented 1 year ago

Yes I just have it on default ruleset to start, customizing them is going to take a while. Is there a way of removing an entire set of rules that are considered notices and/or other categories? I really am only concerned with warning and above. I know I can do that for alerting, but not sure on actual writing of events, I do not need to know about the thousands of notice type events it writes out for a large k8s cluster (and this is on one of my smaller testing clusters).

Issif commented 1 year ago

You can set the minimum priority you want to trigger with https://github.com/falcosecurity/charts/blob/master/falco/values.yaml#L446

j771 commented 1 year ago

Cool thanks I will try that out.

Issif commented 1 year ago

I was able to replicate the issue with ~600k events (I don't have the exact count). The good news is,the issue is not in falcosidekick-ui, the bad is, it seems to be directly related to redisearch module.

Before 600k:

127.0.0.1:6379> DBSIZE
(integer) 569770
127.0.0.1:6379> FT.AGGREGATE "eventIndex" "*" GROUPBY 1 @priority REDUCE COUNT 0
1) (integer) 6
2) 1) "priority"
   2) "Warning"
   3) "__generated_aliascount"
   4) "194060"
3) 1) "priority"
   2) "Notice"
   3) "__generated_aliascount"
   4) "114170"
4) 1) "priority"
   2) "Debug"
   3) "__generated_aliascount"
   4) "22793"
5) 1) "priority"
   2) "Informational"
   3) "__generated_aliascount"
   4) "128482"
6) 1) "priority"
   2) "Error"
   3) "__generated_aliascount"
   4) "53467"
7) 1) "priority"
   2) "Critical"
   3) "__generated_aliascount"
   4) "57068"

After 600k:

127.0.0.1:6379> DBSIZE
(integer) 693032
127.0.0.1:6379> FT.AGGREGATE "eventIndex" "*" GROUPBY 1 @priority REDUCE COUNT 0
1) (integer) 473299
2) (empty array)
(2.43s)

As you can see, the FT.AGGREGATE doesn't return the expected counts, this is what Falcosidekick-UI receives and can't handle. This is essential to create the filter lists and the counts, without, the UI is "frozen". It's really strange and debugging the redis module is out of my skills. Someone told me the image I used in the chart is deprecated, I'll try to dig on that, maybe it will help.

Issif commented 1 year ago

I tested other images from https://hub.docker.com/r/redis/redis-stack/tags including the last release of 6.x and the last RC of 7.x. I face same issue, but the situation seems better anyway, we're facing the issue but after a refresh it comes back. I'll add a condition in the code to avoid to flood the log and temporize a little before a retry.

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana commented 1 year ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 1 year ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/charts/issues/463#issuecomment-1624207553): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.