falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.11k stars 876 forks source link

I upgraded my AKS cluster to version 1.26.1 and Falco stopped working properly #2982

Closed hkinjo2 closed 1 month ago

hkinjo2 commented 6 months ago

AKSクラスターのバージョンを1.26.1に上げたところFalcoがCrashLoopBackOffとなり正常に稼働しなくなりました。 再起動・再デプロイ・新しいバージョンのFlacoのデプロイ・割り当てcpuの変更など行いましたが事象は解決されませんでした。 こちらの事象についてご教示いただけると幸いです。 ※現在のFalcoのバージョンは1.29.1です。

画像 (4)

incertum commented 6 months ago

My best bet is that there are issues with downloading and/or loading the kernel driver.

Incidentally we just added more info to the following 2 repos, perhaps useful as well:

Could you overwrite your Falco container entrypoint to something like?

command: ["/bin/sh"]
args:
  - -c
  - >-
    sleep 10000000

and after execing into the pod, launch Falco manually and tell us the error message? What kernel driver are you using? If you can try --modern-bpf.

Andreagit97 commented 6 months ago

thank you for reporting, could you provide the Falco logs of one of the pods? Could you rewrite the issue in English, please?

hkinjo2 commented 6 months ago

English translation.

I upgraded my AKS cluster to version 1.26.1 and Falco stopped working properly with CrashLoopBackOff. I have rebooted, redeployed, deployed a new version of Flaco, changed the assigned cpu, etc., but the problem has not been resolved. We would appreciate any information you can give us on this issue. The current version of Falco is 1.29.1.

Confirmation. Connect with the following command.

kubectl exec -it -n falco

service falco start or systemctl start falco and run this command (journalctl -fu falco) or is the understanding to provide logs output to /var/messages?

hkinjo2 commented 6 months ago

I tried to connect to Falco with the previous recognition, but could not confirm it due to the CrashLoopBackOff status as shown in the operation log.

Am I correct in my understanding that the repository should be added before logging?

If so, I apologize. I do not know how to add a repository due to my limited knowledge. I would appreciate it if you could tell me how to do this as well. sankou.txt 画像 (5)

hkinjo2 commented 6 months ago

https://github.com/falcosecurity/falco/issues/2982#issuecomment-1863860540

My best guess is that there is a problem downloading and/or loading the kernel drivers.

BTW, I have added more information to the following two repositories. Perhaps they will be equally helpful.

https://github.com/falcosecurity/deploy-kubernetes/tree/main/kuberneteshttps://github.com/falcosecurity/cncf-green-review-testing Could you please overwrite the Falco container entry point with something like

@incertum I am sorry. Please let me know the procedure again about the above.

hkinjo2 commented 6 months ago

@incertum @Andreagit97

Hello.

I apologize for my lack of knowledge. Please let me know whatever information you need to solve the problem. We will try to get it.

However, there are many operations that I do not know how to acquire and that is my issue.

Please help me. Please let me know how to acquire it along with the necessary information.

Andreagit97 commented 6 months ago

As a first attempt, I would try to install Falco with the modern-bpf probe, which is the easiest method we have https://github.com/falcosecurity/charts/tree/master/charts/falco#daemonset. So you have to type:

# to update the helm chart to the latest version
helm repo update falcosecurity --fail-on-repo-update-fail
# to run Falco with modern ebpf
helm install falco falcosecurity/falco \
    --set driver.kind=modern-bpf
hkinjo2 commented 5 months ago

@Andreagit97

Thank you for letting me know. I changed the driver parameters as instructed and deployed it, but the result was the same. We will share the work records and logs, so please check them.

We look forward to hearing your opinions. 20240115_Falco作業証跡.xlsx logs_69.zip

Andreagit97 commented 5 months ago

Looking at your logs:

2024-01-15T10:20:47.7261752Z REVISION   UPDATED                     STATUS      CHART           APP VERSION DESCRIPTION     
2024-01-15T10:20:47.7262638Z 13         Fri Jul 22 14:35:26 2022    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7266217Z 14         Thu Jul 27 13:26:31 2023    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7267097Z 15         Fri Aug  4 15:50:11 2023    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7268266Z 16         Fri Aug  4 15:57:54 2023    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7269001Z 17         Fri Aug  4 17:46:56 2023    superseded  falco-3.4.1     0.35.1      Upgrade complete
2024-01-15T10:20:47.7270362Z 18         Tue Dec 12 18:27:52 2023    superseded  falco-3.4.1     0.35.1      Upgrade complete
2024-01-15T10:20:47.7271007Z 19         Tue Dec 12 18:42:48 2023    superseded  falco-3.4.1     0.35.1      Upgrade complete
2024-01-15T10:20:47.7271644Z 20         Thu Dec 14 16:40:31 2023    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7272280Z 21         Wed Dec 20 11:29:21 2023    superseded  falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:47.7272964Z 22         Mon Jan 15 19:11:44 2024    deployed    falco-1.15.2    0.29.0      Upgrade complete
2024-01-15T10:20:51.7628288Z Release "falco" has been upgraded. Happy Helming!
2024-01-15T10:20:51.7629087Z NAME: falco
2024-01-15T10:20:51.7630661Z LAST DEPLOYED: Mon Jan 15 19:20:49 2024
2024-01-15T10:20:51.7631105Z NAMESPACE: falco
2024-01-15T10:20:51.7631467Z STATUS: deployed
2024-01-15T10:20:51.7631804Z REVISION: 23
2024-01-15T10:20:51.7632196Z TEST SUITE: None
2024-01-15T10:20:51.7632611Z NOTES:
2024-01-15T10:20:51.7633104Z Falco agents are spinning up on each node in your cluster. After a few
2024-01-15T10:20:51.7633820Z seconds, they are going to start monitoring your containers looking for
2024-01-15T10:20:51.7634370Z security issues.
2024-01-15T10:20:51.7634619Z 
2024-01-15T10:20:51.7634633Z 
2024-01-15T10:20:51.7634867Z No further action should be required.
2024-01-15T10:20:51.7635178Z 
2024-01-15T10:20:51.7635189Z 
2024-01-15T10:20:51.7635330Z Tip: 
2024-01-15T10:20:51.7636551Z You can easily forward Falco events to Slack, Kafka, AWS Lambda and more with falcosidekick. 
2024-01-15T10:20:51.7637337Z Full list of outputs: https://github.com/falcosecurity/charts/tree/master/falcosidekick.
2024-01-15T10:20:51.7638085Z You can enable its deployment with `--set falcosidekick.enabled=true` or in your values.yaml. 
2024-01-15T10:20:51.7638735Z See: https://github.com/falcosecurity/charts/blob/master/falcosidekick/values.yaml for configuration values.

It seems like you are using the wrong Falco version

2024-01-15T10:20:47.7272964Z 22         Mon Jan 15 19:11:44 2024    deployed    falco-1.15.2    0.29.0      Upgrade complete

Maybe you could try to delete the actual helm deployment with

helm uninstall falco

and then try again

# to update the helm chart to the latest version
helm repo update falcosecurity --fail-on-repo-update-fail
helm show chart falcosecurity/falco

The output should be something like

apiVersion: v2
appVersion: 0.36.2
dependencies:
- condition: falcosidekick.enabled
  name: falcosidekick
  repository: https://falcosecurity.github.io/charts
  version: 0.7.11
...

and then:

# to run Falco with modern ebpf
helm install falco falcosecurity/falco \
    --set driver.kind=modern-bpf
hkinjo2 commented 5 months ago

@Andreagit97 Sorry for the delay in implementation. I tried to remove Falco using the command you provided, but the output was as follows.

Is it safe to delete the daemon from the Azure portal, delete and reinsert the deployed pod?

image

Andreagit97 commented 5 months ago

What is the output after these 2 commands?

# to update the helm chart to the latest version
helm repo update falcosecurity --fail-on-repo-update-fail
helm show chart falcosecurity/falco
hkinjo2 commented 5 months ago

@Andreagit97

The result is as below. image

By the way, the way to delete the daemon set is to delete the Falco daemon set from the portal screen below. image

Andreagit97 commented 5 months ago

ok, I don't know why you have Falco deployed through Azure but I think you can remove it. At the end of the cleanup, you should check that there are no Falco instances in the cluster:

kubectl get pods -A | grep falco

This should return nothing...

If you are in this situation now you can simply deploy Falco with:

helm install falco falcosecurity/falco \
    --set driver.kind=modern-bpf
hkinjo2 commented 5 months ago

@Andreagit97 Thank you for teaching. I immediately installed the latest version of Falco. The results are as follows. The problem persists despite the latest versi 20240126_開発環境_Falco最新Vserデプロイ.txt on.

I will also send you the entire log, so please check it. image image

hkinjo2 commented 5 months ago

P.S. The command you provided didn't work, so I used the upgrade command.

hkinjo2 commented 4 months ago

@Andreagit97 Don't you know the cause of this problem? We've been investigating this for the past two weeks, but we can't find the cause. Please let me know if you have any advice.

Andreagit97 commented 4 months ago

ei @hkinjo2 to help you we need the Falco startup logs, otherwise we cannot understand what is going on... These are the startup Falco logs:

Sep 11 08:46:08 localhost.localdomain falco[789]: Falco version: 0.37.1 (x86_64)
Sep 11 08:46:08 localhost.localdomain falco[789]: Falco initialized with configuration file: /etc/falco/falco.yaml
Sep 11 08:46:08 localhost.localdomain falco[789]: Loading rules from file /etc/falco/falco_rules.yaml
Sep 11 08:46:08 localhost.localdomain falco[789]: Loading rules from file /etc/falco/falco_rules.local.yaml
Sep 11 08:46:08 localhost.localdomain falco[789]: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Sep 11 08:46:08 localhost.localdomain falco[789]: Starting health webserver with threadiness 4, listening on port 8765
Sep 11 08:46:08 localhost.localdomain falco[789]: Loaded event sources: syscall
Sep 11 08:46:08 localhost.localdomain falco[789]: Enabled event sources: syscall
Sep 11 08:46:08 localhost.localdomain falco[789]: Opening 'syscall' source with modern BPF probe.
Sep 11 08:46:08 localhost.localdomain falco[789]: One ring buffer every '2' CPUs.
...

You can obtain them by running kubectl logs <falco_pod_name>

hkinjo2 commented 4 months ago

@Andreagit97

I talked about it last time too. As you can see, Falco is not running, so no logs can be collected. What shall we do.....

image

alacuku commented 4 months ago

Hi @hkinjo2, it seems that one of the initContainers is failing. Please provide the logs for those containers. To get the logs of a previous run for a given container use the --previous flag:

kubectl logs --previous -n falco  falco-7f5qx name-of-init-container-here
hkinjo2 commented 4 months ago

@alacuku We become indebted to. Sorry for my late reply. I tried using the command you provided, but it still didn't work. Please see the image. image image

hkinjo2 commented 3 months ago

@alacuku

Hello Can I investigate using the information you previously provided? Please tell us about your current situation. It's a no-brainer at my place of work. We apologize for any inconvenience. Thank you very much.

hkinjo2 commented 3 months ago

@Andreagit97

We upgraded the AKS cluster to a supported version yesterday, April 4, including the possibility of isolating whether the problem was caused by AKS or by Falco. (1.27.9) The upgrade was successfully completed, but Falco is still not resolved. We checked the events of the pods and found the following message. Preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..." Please see the attached file for the actual screen of the above message. Therefore, we would appreciate it if you could provide us with Falco's perspective on possible causes and solutions. Thank you in advance.

image

akirtanabe commented 2 months ago

@Andreagit97 We become indebted to. What do you think about the status of the investigation?

Due to the version upgrade of AKS equipped with Falco, There was a difference in the container used with other normally operating Falcos, so we will cooperate. Compared to an environment that is running normally, "falcoctl-artifact-follow" is added extra. Originally, it was assumed that only a container called "Falco" was attached.

===抜粋======== NameSpace NAME CONTAINERS falco falco-4t57h falco,falcoctl-artifact-follow falco falco-f4h4s falco,falcoctl-artifact-follow falco falco-xd9cz falco,falcoctl-artifact-follow falco falco-z2s8l falco,falcoctl-artifact-follow

Andreagit97 commented 2 months ago

hi folks as we told you we can help you if Falco is not working but we need logs. Please clean all your Falco instances and try this

helm repo update falcosecurity

Then

helm show chart falcosecurity/falco

the output should be something like

apiVersion: v2
appVersion: 0.37.1
dependencies:
...

Then install Falco

helm install falco \
  --set driver.kind=modern_ebpf \
  --set falcoctl.artifact.install.enabled=false \
  --set falcoctl.artifact.follow.enabled=false \
  falcosecurity/falco

if Falco doesn't work and is in CrashLoopBackOff you should provide us with the logs

kubectl logs <your-falco-instance-name>
akirtanabe commented 2 months ago

@Andreagit97 hello Thank you for teaching me. However, the result was not what I expected. I will send it along with the execution log, so please check it.

Thank you for continuing to be with me. thank you. falco.zip

Andreagit97 commented 2 months ago

Uhm it seems you have an already running Falco instance, you can try to delete it with

helm uninstall falco  -n falco

and then retry

helm install falco \
  --set driver.kind=modern_ebpf \
  --set falcoctl.artifact.install.enabled=false \
  --set falcoctl.artifact.follow.enabled=false \
  falcosecurity/falco
akirtanabe commented 1 month ago

@Andreagit97 Hello,

Thank you for your guidance. We recently conducted a reinstall of Falco. Out of the four redeployed pods, three have successfully started up. However, one remains in a pending state without starting up. Attached is the event log for your review to assess if there are any functional implications. Additionally, if there are any impacts, could you please advise on the appropriate course of action? Thank you. falco-rg6n9.17d171d27601c9ec_Event.yaml.zip

Andreagit97 commented 1 month ago

Ei @akirtanabe it seems you reached some pod limits on the node.

reason: FailedScheduling
message: >-
  0/4 nodes are available: 1 Too many pods. preemption: 0/4 nodes are available:
  4 No preemption victims found for incoming pod..

This is not an issue related to Falco, your nodes are probably reaching their pod limits. I cannot help you here but probably you can search this issue online https://learn.microsoft.com/en-us/answers/questions/761871/unable-to-schedule-pods-on-nodes-it-says-too-many

akirtanabe commented 1 month ago

@Andreagit97 Thank you for your help.

I understand that this may have impacted the AKS environment. The pending pods recovered over time.

So I think reinstalling Falco resolved the issue.

Thank you for your support so far. I'm going to close this issue now.

Thank you.

Andreagit97 commented 1 month ago

great, I'm happy to hear that! I will close the issue

akirtanabe commented 4 weeks ago

@Andreagit97 Hello.

The event you requested also occurred in a different environment. I reinstalled Falco but this also failed.

Could you please take a look at the log for details? I want to solve it in a hurry.

20240612_falco.txt 20240613_falco.txt