Couldn't resolve HEALTH CHECK on GCP Ingress in GKE

Vikram-Raghu commented 9 months ago

Description

Hi I've been working on using clamAV Docker image Tag:1.2 in GKE for a while. Allow me, GKE is my new area recently . I've done some basic deploying and maintaining but might miss some details in understanding.

This is also my first issue in GitHub. Kindly ignore if I miss any formatting.

Recently, I've deployed my docker image in the GKE via yaml description (I'll provide my config below). I've created a simple deployment, service file.

Deployment file:

kind: Deployment
metadata:
  name: clam-av
spec:
  replicas: 1
  selector:
    matchLabels:
      run: clam-av
  template:
    metadata:
      labels:
        run: clam-av
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: XXX-XXX-pool
      terminationGracePeriodSeconds: 60
      containers:
      - name: clamav-container
        image: clamav/clamav:1.2
        resources:
          requests:
            cpu: 200m
            memory: 1Gi
        imagePullPolicy: Always       
        ports:
        - containerPort: 3310
        # - containerPort: 7357

Service file:

kind: Service
metadata:
  name: clam-av-service
  annotations:
    cloud.google.com/backend-config: '{"default": "backend-for-clamAV"}'
spec:
  selector:
    run: clam-av
  ports:
  - name: http3310
    protocol: TCP
    port: 80
    targetPort: 3310
  # - name: http7357
  #   protocol: TCP
  #   port: 80
  #   targetPort: 7357
  type: ClusterIP

I've created a route path on GCP Ingress mapping to this service, also for which created a simple backend-config file:

kind: BackendConfig
metadata:
  name: backend-for-clamAV
spec:
  timeoutSec: 150
  connectionDraining:
    drainingTimeoutSec: 150
  healthCheck:
    checkIntervalSec: 15
    port: 80
    type: HTTP
    requestPath: /
    healthyThreshold: 1
    unhealthyThreshold: 3
    timeoutSec: 15

Whenever I tried to access the page via domain I've used in Ingress I get the following page (Screenshot below):

Also having this warning message related to HEALTH CHECK in GKE Ingress page.

I've tested this docker image locally in Docker works fine, for which also created a simple C# Web app to act as client to perform the virus scans for file while uploading. IT WORKS FINE !!.

Note: When I exposed my service as LoadBalaner type I can access the page and also from my C# application but not when i mapped to ingress with a route i couldn't get the working page as in docker tested locally nor could I connect via my application

I think the issue is in resolving HEALTH CHECK in GKE ingress.

If the issue is other than that I could miss in, that could resolve my problem kindly let me update me on this issue.

Provide your suggestions on how to resolve this issue. Thank You !

rsundriyal commented 9 months ago

@Vikram-Raghu As your clamav application works fine locally and through loadBalanacers, I can only suggest two things here.

First, the connectivity of clamav-app to internet which might not be available in ingress settings.

Second, if starting up clamav-app service takes time to download databases, you need to have some readiness/health-check in it.

Not sure what else can be the issue from clamav POV. This seems more of how you have created the wrapper (C#) around clamav & Kubernetes service config.

Vikram-Raghu commented 9 months ago

@rsundriyal Thank You for time and suggestions.

For your First suggestion, I've properly created a route with domain added in ingress. Is this what you are suggesting about ingress settings.

For second, I do have tried to add a readiness and liveness probe setting in the deployment file:

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 3310
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 15
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 3310
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 15

But after adding this health setting the pods keeps failing and unable to start again. Pod keeps crashing and restarted more than 16 times last time checked. So I removed that piece of health settings code from the deployment file. Should I increase my threshold seconds to allow my pod fully download the databases. and then check health ?

Regarding the clamAV wrapper around clamAV, I think the problem is in GKE health check or clamAV hosting side. I've not hosted this application in GKE in any pods. I solely used this application for testing purposes from my local machine to connect with the clamAV server for file scanning.

This was the page coming up from running clamAV in local docker

At first I thought it was not working, after multiple attempts to connect, I came across this wrapper method to connect to this clamd daemon from my application. In which, it WORKS FINE !!.

This same page appears when I set my clamAV-service as LOAD BALANCER and can able to connect with my local application created for testing purposes, which also works fine in this case.

But When I add in ingress with domain mapped to the route, I get the "no healthy upstream" page as mentioned earlier in this post.

snailcatcher commented 9 months ago

Hi there,

I had a similar issue when i have tried to setup clamav as container app. I solved it by just waiting for the TCP socket of clamd to be up and running. Maybe this approach helps you too. I don't know much about GCP, but I think they support not only HTTP probes, but also TCP probes.

...
readinessProbe:
  tcpSocket:
    port: 3310
  initialDelaySeconds: 20
  periodSeconds: 10
livenessProbe:
  tcpSocket:
    port: 3310
  periodSeconds: 20
...

I think this should help you: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-tcp-liveness-probe

vienleidl commented 2 months ago

@snailcatcher I did the same on Azure Container Apps and enabled a log search alert, but I'm not sure if an alert is fired or not in case of the ClamAV container is stuck at loading databases phase, see more https://github.com/Cisco-Talos/clamav/issues/1282

Cisco-Talos / clamav