[nftables] remediation component shutdowns after a failed response

LaurenceJJones commented 2 months ago

What happened?

When the remediation component fails to connect to LAPI currently with nftables, the whole service comes down and flushes the nftables set

time="10-05-2024 11:06:07" level=info msg="Processing new and deleted decisions . . ."
time="10-05-2024 11:07:07" level=error msg="http code 504, invalid body: invalid character '<' looking for beginning of value"
time="10-05-2024 11:07:07" level=info msg="Shutting down backend"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec-blacklists' set in 'crowdsec' table"
time="10-05-2024 11:07:07" level=info msg="flushing 'crowdsec6-blacklists' set in 'crowdsec6' table"
time="10-05-2024 11:07:07" level=fatal msg="process terminated with error: bouncer stream halted"
time="10-05-2024 11:07:17" level=info msg="Starting crowdsec-firewall-bouncer v0.0.28-debian-pragmatic-af6e7e25822c2b1a02168b99ebbf8458bc6728e5"
time="10-05-2024 11:07:17" level=info msg="backend type : nftables"
time="10-05-2024 11:07:17" level=info msg="nftables initiated"

This is not what we want as the IP's currently within set are useful to the service.

What did you expect to happen?

Remediation component should allow for failures to connect to LAPI after the service has started, EG connect first if failed at startup then yes restart but after that should be resilient

How can we reproduce it (as minimally and precisely as possible)?

Bring up a LAPI and firewall remediation, currently user has reported if the response code > 500 the service comes down

Anything else we need to know?

No response

version

remediation component version:

```console $ crowdsec-firewall-bouncer --version # paste output here ```

crowdsec version

crowdsec version:

```console $ crowdsec --version # paste output here ```

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

github-actions[bot] commented 2 months ago

@LaurenceJJones: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

Check Documentation to see if your issue can be self resolved.
You can also join our Discord

Details

I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/cs-firewall-bouncer/blob/main/.github/governance.yml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the [BirthdayResearch/oss-governance-bot](https://github.com/BirthdayResearch/oss-governance-bot) repository.

github-actions[bot] commented 2 months ago

@LaurenceJJones: There are no 'kind' label on this issue. You need a 'kind' label to start the triage process.

/kind feature
/kind enhancement
/kind bug
/kind packaging

Details

I am a bot created to help the [crowdsecurity](https://github.com/crowdsecurity) developers manage community feedback and contributions. You can check out my [manifest file](https://github.com/crowdsecurity/cs-firewall-bouncer/blob/main/.github/governance.yml) to understand my behavior and what I can do. If you want to use this for your project, you can check out the [BirthdayResearch/oss-governance-bot](https://github.com/BirthdayResearch/oss-governance-bot) repository.

dolgovas commented 2 months ago

Hello! We met this trouble. Do you have any update about this trouble?

mr1jingles commented 2 months ago

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

One more question: Why does bouncer reset nftables set on restart?

LaurenceJJones commented 2 months ago

UPDATE. This only happens if the bouncer is restarted. If the api does not respond when bouncer is running, bouncer tries to get new solutions and continues to work.

Yes, this is the current design, as if the remediation component doesn't get an initial connection, then it could be a bad configuration

One more question: Why does bouncer reset nftables set on restart?

We remove the set because it takes ten times more time to do an initial load if we have to check if each element already exists. So, to be more efficient, we remove the set and then reinstate it upon restart

mr1jingles commented 2 months ago

But if the host is under attack and clearing the nftables set can negatively affect the server.

It is also not entirely clear, if bouncer clears the nftables set, why does it pull all decisions (also outdated) if the set is cleared?

LaurenceJJones commented 2 months ago

But if the host is under attack and clearing the nftables set can negatively affect the server.

Yes, but this should only happen if you restart the service when under attack. As the service should be running for a long time unless there is a reason not to run it.

Most likely, the way crowdsec sends decisions, bouncers don't have a direct influence on what they get sent unless it's filtered. There is no impact on performance. You just see an unesscary log line that's all

mr1jingles commented 2 months ago

If the host is under attack, then it is possible that free memory runs out and the OOM process can kill bouncer, so when restarting bouncer clears the table, thereby provoking even more load on the server.

I think it's reasonable to add an option that allows you to compare the data received from the API instead of clearing the table when restarting

mr1jingles commented 2 months ago

About decisions. When I restarted a large number of bouncers, I saw a large load on the database on the API server. This led to a memory leak and complete unavailability of the API Screenshot at May 23 13-45-34 Screenshot at May 23 13-45-48 Screenshot at May 23 13-46-02

LaurenceJJones commented 1 month ago

This led to a memory leak and complete unavailability of the API

Memory as a spike does not equal a memory leak it just means the api is handling the requests, and because it holds decisions in memory whilst it queries, then it will spike.

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

If you can capture the memory leak via pprof, we look into it.

https://docs.crowdsec.net/docs/next/observability/pprof

I understand the OOM part, and we can improve this in the future, but currently, we have no resources to look at this, so contributions are welcome.

LaurenceJJones commented 1 month ago

/kind enhancement /accepted

mr1jingles commented 1 month ago

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

LaurenceJJones commented 1 month ago

We have a feature flag for streamed decisions it may help https://docs.crowdsec.net/docs/next/configuration/feature_flags#list-of-available-feature-flags

Should I enable this flag on the API server?

Correct me if I'm wrong. Does this feature allow you to send decisions in a batch?

Exactly, so instead of getting all decisions in memory, it will fetch X amount then write to stream, then fetch next batch and write to stream and so on and so on. It may become standard for next releases currently it behind a feature flag since we wanted to ensure stability but we have a large enterprise using it in production for over 2 minor releases with no issues reported from their side.

mr1jingles commented 1 month ago

And if I use MySQL as a database server, will it work for it too?

LaurenceJJones commented 1 month ago

And if I use MySQL as a database server, will it work for it too?

Yes works for all databases

crowdsecurity / cs-firewall-bouncer