Closed j771 closed 1 year ago
This is strange, I've a pod running for months without facing that issue. Have you any idea about the count of events that were stored before the crash? It might be related to a full disk for the PVC.
As stated in the initial post storage is over provisioned, the PVC has 30gb available and is only using a few gb. There are many events though, 10,000+.
I injected more 200k+ events without a glitch. The logs seem to link that to echo, the web framework I used but I don't see the link with redis. Have you tried to just delete the ui pods and not redis?
Yes 1st I deleted the ui pods and then loaded the ui again when the pods came up. As soon as I try to load the UI those errors come up and the UI does not load any data just hangs.
The redis looks fine, no error logs, no space issues. So it seems like it is just an error with the UI trying to load events.
Really strange, tell me if it happens again, you're the first one noticing that issue. Thanks for the report
I am just evaluating this service in my dev environment when I have free time and I can play around with settings. Are there any recommendations for debugging or testing this further? When I deploy this again I will set a TTL to for 24 hours or something less than I set before and watch the event numbers and the size of the redis as well as check if anything specific triggers the errors.
Except trying to set a ttl I don't have any advice for now, sorry
Same issue when setting a 24hour ttl. Redis keyspace at the time I tried to access the UI and received the error: db0:keys=661512,expires=661512,avg_ttl=57529256 and using 3.2gb of memory
660k+ keys, I never tried that much. I will replicate and see. Thanks.
Btw, if you have so much events in so little time, you should tweak your rules, you're not fishing with a fishing rod but a net.
Yes I just have it on default ruleset to start, customizing them is going to take a while.
Is there a way of removing an entire set of rules that are considered notices
and/or other categories?
I really am only concerned with warning
and above.
I know I can do that for alerting, but not sure on actual writing of events, I do not need to know about the thousands of notice type events it writes out for a large k8s cluster (and this is on one of my smaller testing clusters).
You can set the minimum priority you want to trigger with https://github.com/falcosecurity/charts/blob/master/falco/values.yaml#L446
Cool thanks I will try that out.
I was able to replicate the issue with ~600k events (I don't have the exact count). The good news is,the issue is not in falcosidekick-ui, the bad is, it seems to be directly related to redisearch module.
Before 600k:
127.0.0.1:6379> DBSIZE
(integer) 569770
127.0.0.1:6379> FT.AGGREGATE "eventIndex" "*" GROUPBY 1 @priority REDUCE COUNT 0
1) (integer) 6
2) 1) "priority"
2) "Warning"
3) "__generated_aliascount"
4) "194060"
3) 1) "priority"
2) "Notice"
3) "__generated_aliascount"
4) "114170"
4) 1) "priority"
2) "Debug"
3) "__generated_aliascount"
4) "22793"
5) 1) "priority"
2) "Informational"
3) "__generated_aliascount"
4) "128482"
6) 1) "priority"
2) "Error"
3) "__generated_aliascount"
4) "53467"
7) 1) "priority"
2) "Critical"
3) "__generated_aliascount"
4) "57068"
After 600k:
127.0.0.1:6379> DBSIZE
(integer) 693032
127.0.0.1:6379> FT.AGGREGATE "eventIndex" "*" GROUPBY 1 @priority REDUCE COUNT 0
1) (integer) 473299
2) (empty array)
(2.43s)
As you can see, the FT.AGGREGATE
doesn't return the expected counts, this is what Falcosidekick-UI receives and can't handle. This is essential to create the filter lists and the counts, without, the UI is "frozen". It's really strange and debugging the redis module is out of my skills. Someone told me the image I used in the chart is deprecated, I'll try to dig on that, maybe it will help.
I tested other images from https://hub.docker.com/r/redis/redis-stack/tags including the last release of 6.x and the last RC of 7.x. I face same issue, but the situation seems better anyway, we're facing the issue but after a refresh it comes back. I'll add a condition in the code to avoid to flood the log and temporize a little before a retry.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
Falco UI runs well on initial deployment but after running for a day the ui pods get the following error when trying to load the UI. The pods do not recover unless the redis storage is wiped out. The redis is currently over provisioned and not running out of disk space or memory.
Error log from falcosidekick-ui pod:
Kubernetes v1.22 Deployed with Helm chart falcosecurity/falco --version 2.5.4 Helm values file: