Vulnerability processing doesn't seem to work in Kubernetes deployment

everping commented 2 years ago

Fleet version: 4.17.0

Operating system: Kubernetes 1.21.5

💥 Actual behavior

I deployed the Fleet instance by using K8S with the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: fleet-app
  name: fleet-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fleet-app
  template:
    metadata:
      labels:
        app: fleet-app
    spec:
      containers:
      - env:
        - name: FLEET_MYSQL_ADDRESS
          value: xxxxx
        - name: FLEET_MYSQL_DATABASE
          value: xxxxx
        - name: FLEET_MYSQL_PASSWORD
          value: xxxxx
        - name: FLEET_MYSQL_USERNAME
          value: xxxxx
        - name: FLEET_REDIS_ADDRESS
          value: localhost:6379
        - name: FLEET_SERVER_TLS
          value: 'false'
        - name: FLEET_LICENSE_KEY
          value: xxxxx
        - name: FLEET_VULNERABILITIES_DATABASES_PATH
          value: /home/fleet/vulndb/
        image: fleetdm/fleet:4.17.0
        name: fleet-app
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
          requests:
            cpu: 250m
            memory: 64Mi
      - env: []
        image: redis:latest
        name: redis
        ports:
        - containerPort: 6379
          protocol: TCP
        resources: {}

Theoretically, the vulnerability processing should work, and I can see vulnerabilities in my software. But in fact, the software tab was empty, and the vulnerability DB folder and CVE/CPE database were empty as well below

So my question is, am I missing something in the Fleet deploying for the Vulnerability Processing to work?

RachelElysia commented 2 years ago

Hi @everping, is the software table still not showing up?

everping commented 2 years ago

@RachelElysia the software tab is shown up now but the vulnerable software still didn't appear. So I think the Vuln Processing does not work

RachelElysia commented 2 years ago

@everping So sorry to hear that. Can you try running fleetctl get config --include-server-config to view the vulnerabilities settings?

michalnicp commented 2 years ago

By default, vulnerability processing runs every hour and requires the software inventory to be populated from hosts. You can change this to something shorter by setting FLEET_VULNERABILITIES_PERIODICITY=5m. Also, there may be an issue with a lock not being released that is preventing vulnerability processing from running. You can check this by connecting to the db an running select * from locks where name = 'vulnerabilities'. The lock will eventually expire after expires_at.

everping commented 2 years ago

@RachelElysia I had to set the config address and log in to Fleet API to let fleetctl get config work. The result is as below

---
apiVersion: v1
kind: config
spec:
  agent_options:
    config:
      decorators:
        load:
        - SELECT uuid AS host_uuid FROM system_info;
        - SELECT hostname AS hostname FROM system_info;
      options:
        disable_distributed: false
        distributed_interval: 10
        distributed_plugin: tls
        distributed_tls_max_attempts: 3
        logger_plugin: tls
        logger_tls_endpoint: /api/osquery/log
        logger_tls_period: 10
        pack_delimiter: /
    overrides: {}
  fleet_desktop:
    transparency_url: https://fleetdm.com/transparency
  host_expiry_settings:
    host_expiry_enabled: false
    host_expiry_window: 0
  host_settings:
    enable_host_users: true
    enable_software_inventory: true
  integrations:
    jira: null
    zendesk: null
  license:
    device_count: 1
    expiration: "2023-07-12T05:06:52Z"
    note: Created with Fleet License key dispenser
    organization: xxxxx
    tier: premium
  logging:
    debug: false
    json: false
    result:
      config:
        enable_log_compression: false
        enable_log_rotation: false
        result_log_file: /tmp/osquery_result
        status_log_file: /tmp/osquery_status
      plugin: filesystem
    status:
      config:
        enable_log_compression: false
        enable_log_rotation: false
        result_log_file: /tmp/osquery_result
        status_log_file: /tmp/osquery_status
      plugin: filesystem
  org_info:
    org_logo_url: xxxxx
    org_name: xxxxx
  server_settings:
    deferred_save_host: false
    enable_analytics: true
    live_query_disabled: false
    server_url: xxxxx
  smtp_settings:
    authentication_method: authmethod_plain
    authentication_type: authtype_username_password
    configured: false
    domain: ""
    enable_smtp: false
    enable_ssl_tls: true
    enable_start_tls: true
    password: ""
    port: 587
    sender_address: ""
    server: ""
    user_name: ""
    verify_ssl_certs: true
  sso_settings:
    enable_sso: false
    enable_sso_idp_login: false
    entity_id: ""
    idp_image_url: ""
    idp_name: ""
    issuer_uri: ""
    metadata: ""
    metadata_url: ""
  update_interval:
    osquery_detail: 1h0m0s
    osquery_policy: 1h0m0s
  vulnerabilities:
    cpe_database_url: ""
    current_instance_checks: auto
    cve_feed_prefix_url: ""
    databases_path: /home/fleet/vulndb/
    disable_data_sync: false
    periodicity: 1h0m0s
    recent_vulnerability_max_age: 720h0m0s
  vulnerability_settings:
    databases_path: ""
  webhook_settings:
    failing_policies_webhook:
      destination_url: ""
      enable_failing_policies_webhook: false
      host_batch_size: 0
      policy_ids: null
    host_status_webhook:
      days_count: 0
      destination_url: ""
      enable_host_status_webhook: false
      host_percentage: 0
    interval: 24h0m0s
    vulnerabilities_webhook:
      destination_url: ""
      enable_vulnerabilities_webhook: false
      host_batch_size: 0

everping commented 2 years ago

@michalnicp There actually has a locks record. How should we deal with this?

michalnicp commented 2 years ago

You shouldn't need to do anything with the lock. After it expires, fleet should resume vulnerability processing. There can be some issues releasing the lock if the pod dies, but should be resolved after the lock expires on its own.

Do you see any errors in the logs or unusual OOMKilled events in the output from

kubectl describe pod [fleet-pod]

everping commented 2 years ago

@michalnicp I waited until the locking expired, but it automatically updated the owner and expired time again as you see below. That seems to be a deadlock

I have also checked the pod and no error events happened. But when checking the application logs, I got

level=error ts=2022-07-15T18:58:03.868518704Z component=http method=POST uri=/api/v1/osquery/config took=3.383427ms ip_addr=zzzzz x_for_ip_addr=xxxx err="internal error: fetch base config: load team agent options for host: select team: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\"$.agent_options\" FROM teams WHERE id = ?' at line 1"
level=error ts=2022-07-15T18:58:07.951171239Z component=http method=POST uri=/api/v1/osquery/config took=2.790603ms ip_addr=zzzzz x_for_ip_addr=zzzzz err="internal error: fetch base config: load team agent options for host: select team: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\"$.agent_options\" FROM teams WHERE id = ?' at line 1"

Does this cause the locking?

zwass commented 2 years ago

@everping what's your MySQL version? That syntax error about the JSON syntax might mean that your version is incompatible, and that could be causing the issues with vulnerability processing.

The "lock" that you see is indicating that one of the Fleet servers has declared its intent to do the vulnerability processing. That's normal and I don't see any indication of a deadlock.

michalnicp commented 2 years ago

I suspect the issue with the sql query may be that ANSI_QUOTES is enabled in MySQL. Can you confirm by running

SELECT @@sql_mode;

michalnicp commented 2 years ago

I suspect the issue may be caused by not enough memory for vulnerability processing. According to https://fleetdm.com/docs/deploying/reference-architectures, we recommend 4 GB of memory. As noted in your deployment yaml above, 128 MB is not enough. I suspect that the pod is getting OOMKilled by k8s.

everping commented 2 years ago

@michalnicp

I suspect the issue with the sql query may be that ANSI_QUOTES is enabled in MySQL. Can you confirm by running
SELECT @@sql_mode;

Yes, ANSI_QUOTES is enabled and I'm using MySQL 8. Should I disable it for fleetdm?

I suspect the issue may be caused by not enough memory for vulnerability processing. According to https://fleetdm.com/docs/deploying/reference-architectures, we recommend 4 GB of memory. As noted in your deployment yaml above, 128 MB is not enough. I suspect that the pod is getting OOMKilled by k8s.

Yes, I have checked the pod status and got

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

michalnicp commented 2 years ago

Yes, ANSI_QUOTES is enabled and I'm using MySQL 8. Should I disable it for fleetdm?

This issue has come up a few times. We have generally tried to make fleet work with ANSI_QUOTES enabled, but sometimes we miss things. I have already opened a pr to fix this one #6707. You should disable it for now as a workaround.

Can you increase the pod memory limit and see if vulnerability processing starts working?

everping commented 2 years ago

@michalnicp I updated the deployment as below

        resources:
          limits:
            cpu: 1024m
            memory: 4096Mi
          requests:
            cpu: 500m
            memory: 128Mi

No OOMKilled appears but Vulnerable software list is still empty

michalnicp commented 2 years ago

Do you see any errors in the logs from fleet? Have you waited at least 1 hour? Now that the pod is not getting killed by k8s, we should have a better chance of tracking down the issue.

everping commented 2 years ago

@michalnicp Until now, the vuln processing still does not work. No vulnerable software appears while the pod is not killed by k8s (its age is over 3 days) The only error I'm getting is the SQL syntax error caused by ANSI quotes

juan-fdz-hawa commented 2 years ago

Hi @everping - Do you mind running this and posting back the resutls?

SELECT json_value FROM aggregated_stats WHERE type = 'os_versions';

Thanks

everping commented 2 years ago

@juan-fdz-hawa I'm attaching the screenshot here

juan-fdz-hawa commented 2 years ago

Thanks @everping we found a bug with vulnerability processing for LTS versions of Ubuntu - this should be fixed in the next patch release.

noahtalerman commented 2 years ago

Thanks @juan-fdz-hawa! Opening this issue because the patch has not been released yet.

I'm also adding this issue to the release board so that it's tracked.

fleetdm / fleet

Vulnerability processing doesn't seem to work in Kubernetes deployment #6654

💥 Actual behavior