aws-containers / kubectl-detector-for-docker-socket

A Kubectl plugin that can detect if any of your workloads or manifest files are mounting the docker.sock volume
Apache License 2.0
175 stars 21 forks source link

Error scanning namespace workloads if there are batch jobs running on it #18

Open saholo21 opened 1 year ago

saholo21 commented 1 year ago

I am trying to scan a cluster that has different kinds of workloads (deployments, pods, statefulsets, batch jobs, etc). However, when the scan finishes, I always get the same error: "jobs.batch not found. The following table may be incomplete due to errors detected during the run." The table only returns a single row, analyzing the kube-system namespace, but not all the other workloads, which amount to more than 300. I believe this issue arises because when the scan starts, there are some jobs running but then they finish during the scan (as they are meant to do). However, the plugin interprets this as an issue and throws an error. Is there any workaround for this problem?

Input = kubectl dds

Output = error: [jobs.batch "job1" not found, jobs.batch "job2" not found, jobs.batch "job3" not found, jobs.batch "job4" not found] Warning: The following table may be incomplete due to errors detected during the run NAMESPACE TYPE NAME STATUS kube-system daemonset aws-node mounted

rothgar commented 1 year ago

Do you have a yaml example of the workload you're running?

saholo21 commented 1 year ago

No, I don't have access to jobs batch yaml. Is there any possibility of running the kubectl dds only for certain types of workloads? i.e., only for deployments, then do it only for statefulset and so on, to avoid the jobs batch scanning error

rothgar commented 1 year ago

That might be difficult to implement because the way it works is it scans all pods and then looks for the parent of the pod. It doesn't have a way to start with deployments and work their way down to the pods.

If I implemented this what types of flags would you want? --scan-resource=deployment or --skip=job It would get complicated to add both options but I would need something that could be the default behavior eg --scan-type=all but either way I still have to scan all pods in the cluster and inspect what owns them.

saholo21 commented 1 year ago

Understood, the type of flag that would fit the best for this case would be --skip=job, because that's the only workload with which I'm facing issues. However, do you know what could be happening? I mean, there are some running jobs but then they finish during the scan as they meant to do, but the plugin detects this as an error, Is that an expected behavior? Thanks for answering

rothgar commented 1 year ago

I'm not too sure what would be causing it without being able to replicate the problem or seeing the job spec with something like kubectl get job job1 --output yaml

What version of Kubernetes are you using?

saholo21 commented 1 year ago

I was able to get one of the job workloads that's throwing the error. I am using Kubernetes 1.23 version. Let me know if that helps.

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2023-09-05T11:55:32Z"
  generation: 1
  labels:
    controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
    job-name: populate-analytic-data-aws-28231914
  name: populate-analytic-data-aws-28231914
  namespace: default
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: CronJob
    name: populate-analytic-data-aws
    uid: 4bb57997-3256-4197-b36d-3172c50732a8
  resourceVersion: "1177585793"
  uid: 80fef74c-a01f-4059-b345-d9238c974bec
spec:
  activeDeadlineSeconds: 10000
  backoffLimit: 3
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 80fef74c-a01f-4059-b345-d9238c974bec
        job-name: populate-analytic-data-aws-28231914
    spec:
      containers:
      - args:
        - --botName
        - populate-analytic-data
        - --cassandra
        - cassandra-traffic-04.internal.company.com,cassandra-traffic-02.internal.company.com,cassandra-traffic-03.internal.company.com
        - --keyspace
        - traffic
        - --threads
        - "4"
        - --env
        - staging
        env:
        - name: ENV
          value: staging
        - name: log_level
          value: DEBUG
        image: 111111111111.dkr.ecr.us-east-1.amazonaws.com/populate-analytic-data:4.53-reporting
        imagePullPolicy: IfNotPresent
        name: docker
        resources:
          limits:
            cpu: 450m
            memory: 2000Mi
          requests:
            cpu: 250m
            memory: 1400Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: "2023-09-05T14:42:12Z"
    lastTransitionTime: "2023-09-05T14:42:12Z"
    message: Job was active longer than specified deadline
    reason: DeadlineExceeded
    status: "True"
    type: Failed
  failed: 1
  startTime: "2023-09-05T11:55:32Z"
saholo21 commented 1 year ago

Hi @rothgar is there any update about this?

rothgar commented 1 year ago

Thank you for the example. I'm sorry I haven't been able to test this yet. I'm preparing for some work travel and conference talks and other priorities at work.

saholo21 commented 1 year ago

Hi @rothgar. Just a quick question to confirm something, if the error message only shows some jobs and the final warning says "The following table may be incomplete due to errors detected during the run" means that the result may be incomplete because only the jobs were not scanned and it is not known if they have a docker.sock mount or because this error with the jobs could have stopped the missing scans of other workloads (deployments, daemonsets, statefulset, etc)?

rothgar commented 1 year ago

It should continue with other jobs and workload types. It doesn't exit the app. It appends the error and continues. https://github.com/aws-containers/kubectl-detector-for-docker-socket/blob/main/main.go#L270-L273