Closed truongnht closed 11 months ago
After the manual job delete, I notice that node-collector job spins up again and targeting the node which is no longer available.
@truongnht thanks for reporting ,I'll review it and update you
Hi @chen-keinan , I am curious if you have some result on the investigation?
Hi @chen-keinan , I am curious if you have some result on the investigation?
@truongnht due to kubecon (America) prep and attendance it took longer , I'll get to it next week
@truongnht are you using Karpenter ?
yes, we are using Karpenter
not Karpenter expert, guessing could be related to topologyKey podAntiAffinity
no sure how to fix it when using karpanter , but you can disable infraassessment as workaround
also suggest to use toleration if possible
We're running into a similar problem, not using karpenter
But trivy starts a node-collector job selecting a control-plane node but without the necessary toleration
not Karpenter expert, guessing could be related to topologyKey podAntiAffinity
no sure how to fix it when using karpanter , but you can disable infraassessment as workaround
also suggest to use toleration if possible
@chen-keinan , indeed we took the workaround as disabling infraassessment
, however it is better this ticket is fixed
not Karpenter expert, guessing could be related to topologyKey podAntiAffinity no sure how to fix it when using karpanter , but you can disable infraassessment as workaround also suggest to use toleration if possible
@chen-keinan , indeed we took the workaround as disabling
infraassessment
, however it is better this ticket is fixed
sure , one is not depend the other
I see this issue has been closed as completed, but this is still happening to us, and I don't know how #1644 was supposed to fix this?
We had to reenable the node-selector, but trivy still creates jobs without tolerations for control-plane nodes, and I assume for nodes tainted for other reasons as well?
Should I open a new issue?
@cwrau sure, please open a new issue and add it details.
Note: you can choose if to use node selector as by default the node-collector
is running on any deployed Node
@cwrau sure, please open a new issue and add it details. Note: you can choose if to use node selector as by default the
node-collector
is running on any deployed Node
Ok?, but how is trivy supposed to collect node info while running on another node? 😅 I see the node-collector job is mounting stuff from the host?
That's why we enabled the node-selector, but using the node-selector the job won't be scheduled on the control-plane, due to taints
@cwrau
That's why we enabled the node-selector, but using the node-selector the job won't be scheduled on the control-plane, due to taints
note sure what is your use-case , it has been requested by other community to have this ability. are you setting toleration? can you provide more info on your use-case ?
it has been requested by other community to have this ability.
Yeah, and I don't know how they expect this feature to work without the node-selector 😅 Or, is it working without the node-selector? I can't imagine how, as it's mounting stuff from the host and all Infraassessmentreports are the same.
are you setting toleration?
No, is that required? I can find nothing interesting about the node-selector in the docs. And trivy is deciding to launch jobs, I would've assumed that it would also figure out the tolerations to set or skip these nodes.
can you provide more info on your use-case ?
I don't know how to explain it in detail, I just want trivy to scan the nodes 😅
it has been requested by other community to have this ability.
Yeah, and I don't know how they expect this feature to work without the node-selector 😅 Or, is it working without the node-selector? I can't imagine how, as it's mounting stuff from the host and all Infraassessmentreports are the same.
are you setting toleration?
No, is that required? I can find nothing interesting about the node-selector in the docs. And trivy is deciding to launch jobs, I would've assumed that it would also figure out the tolerations to set or skip these nodes.
can you provide more info on your use-case ?
I don't know how to explain it in detail, I just want trivy to scan the nodes 😅
providing more info mean , add your configuration (cm), logs, env type (cloud , on-prem), trivy-operator version, screenshot from stuck pod or anything which could help me to understand what is the problem.
but at 1st I suggest you set toleration if you want the pod the be schedule taint Node.
it has been requested by other community to have this ability.
Yeah, and I don't know how they expect this feature to work without the node-selector 😅 Or, is it working without the node-selector? I can't imagine how, as it's mounting stuff from the host and all Infraassessmentreports are the same.
are you setting toleration?
No, is that required? I can find nothing interesting about the node-selector in the docs. And trivy is deciding to launch jobs, I would've assumed that it would also figure out the tolerations to set or skip these nodes.
can you provide more info on your use-case ?
I don't know how to explain it in detail, I just want trivy to scan the nodes 😅
providing more info mean , add your configuration (cm)
compliance.failEntriesLimit: "10"
configAuditReports.scanner: Trivy
node.collector.imageRef: ghcr.io/aquasecurity/node-collector:0.1.1
node.collector.nodeSelector: "true"
nodeCollector.volumeMounts: '[{"mountPath":"/var/lib/etcd","name":"var-lib-etcd","readOnly":true},{"mountPath":"/var/lib/kubelet","name":"var-lib-kubelet","readOnly":true},{"mountPath":"/var/lib/kube-scheduler","name":"var-lib-kube-scheduler","readOnly":true},{"mountPath":"/var/lib/kube-controller-manager","name":"var-lib-kube-controller-manager","readOnly":true},{"mountPath":"/etc/systemd","name":"etc-systemd","readOnly":true},{"mountPath":"/lib/systemd/","name":"lib-systemd","readOnly":true},{"mountPath":"/etc/kubernetes","name":"etc-kubernetes","readOnly":true},{"mountPath":"/etc/cni/net.d/","name":"etc-cni-netd","readOnly":true}]'
nodeCollector.volumes: '[{"hostPath":{"path":"/var/lib/etcd"},"name":"var-lib-etcd"},{"hostPath":{"path":"/var/lib/kubelet"},"name":"var-lib-kubelet"},{"hostPath":{"path":"/var/lib/kube-scheduler"},"name":"var-lib-kube-scheduler"},{"hostPath":{"path":"/var/lib/kube-controller-manager"},"name":"var-lib-kube-controller-manager"},{"hostPath":{"path":"/etc/systemd"},"name":"etc-systemd"},{"hostPath":{"path":"/lib/systemd"},"name":"lib-systemd"},{"hostPath":{"path":"/etc/kubernetes"},"name":"etc-kubernetes"},{"hostPath":{"path":"/etc/cni/net.d/"},"name":"etc-cni-netd"}]'
report.recordFailedChecksOnly: "true"
scanJob.podTemplateContainerSecurityContext: '{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"privileged":false,"readOnlyRootFilesystem":true,"runAsGroup":10000,"runAsNonRoot":true,"runAsUser":10000}'
scanJob.podTemplatePodSecurityContext: '{"seccompProfile":{"type":"RuntimeDefault"}}'
vulnerabilityReports.scanner: Trivy
, logs
No helpful logs, only that the node is found and the job is getting scheduled;
DEBUG node-reconciler Getting node from cache {"node": {"name":"1111-teuto-scan-2207-control-plane-l42ss-gj8w9"}}
DEBUG node-reconciler Checking whether cluster Infra assessments report exists {"node": {"name":"1111-teuto-scan-2207-control-plane-l42ss-gj8w9"}}
DEBUG node-reconciler Checking whether Node info collector job have been scheduled {"node": {"name":"1111-teuto-scan-2207-control-plane-l42ss-gj8w9"}}
DEBUG node-reconciler Checking node collector jobs limit {"node": {"name":"1111-teuto-scan-2207-control-plane-l42ss-gj8w9"}, "count": 0, "limit": 3}
DEBUG node-reconciler Scheduling Node collector job {"node": {"name":"1111-teuto-scan-2207-control-plane-l42ss-gj8w9"}}
, env type (cloud , on-prem)
private cloud -> on-prem
, trivy-operator version
0.18.5
, screenshot from stuck pod or anything which could help me to understand what is the problem.
The job;
apiVersion: batch/v1
kind: Job
metadata:
annotations:
batch.kubernetes.io/job-tracking: ""
creationTimestamp: "2024-03-12T09:34:41Z"
generation: 1
labels:
app.kubernetes.io/managed-by: trivy-operator
node-info.collector: Trivy
trivy-operator.resource.kind: Node
trivy-operator.resource.name: 1111-teuto-scan-2207-control-plane-l42ss-gj8w9
name: node-collector-756ffb6f47
namespace: trivy
resourceVersion: "487568794"
uid: 13954e28-513e-47d9-b563-4ca968cc06b0
spec:
activeDeadlineSeconds: 900
backoffLimit: 0
completionMode: NonIndexed
completions: 1
parallelism: 1
selector:
matchLabels:
batch.kubernetes.io/controller-uid: 13954e28-513e-47d9-b563-4ca968cc06b0
suspend: false
template:
metadata:
creationTimestamp: null
labels:
app: node-collector
batch.kubernetes.io/controller-uid: 13954e28-513e-47d9-b563-4ca968cc06b0
batch.kubernetes.io/job-name: node-collector-756ffb6f47
controller-uid: 13954e28-513e-47d9-b563-4ca968cc06b0
job-name: node-collector-756ffb6f47
spec:
automountServiceAccountToken: true
containers:
- args:
- k8s
- --node
- 1111-teuto-scan-2207-control-plane-l42ss-gj8w9
command:
- node-collector
image: ghcr.io/aquasecurity/node-collector:0.1.1
imagePullPolicy: IfNotPresent
name: node-collector
resources:
limits:
cpu: 100m
memory: 100M
requests:
cpu: 50m
memory: 50M
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 10000
runAsNonRoot: true
runAsUser: 10000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/etcd
name: var-lib-etcd
readOnly: true
- mountPath: /var/lib/kubelet
name: var-lib-kubelet
readOnly: true
- mountPath: /var/lib/kube-scheduler
name: var-lib-kube-scheduler
readOnly: true
- mountPath: /var/lib/kube-controller-manager
name: var-lib-kube-controller-manager
readOnly: true
- mountPath: /etc/systemd
name: etc-systemd
readOnly: true
- mountPath: /lib/systemd/
name: lib-systemd
readOnly: true
- mountPath: /etc/kubernetes
name: etc-kubernetes
readOnly: true
- mountPath: /etc/cni/net.d/
name: etc-cni-netd
readOnly: true
dnsPolicy: ClusterFirst
hostPID: true
nodeSelector:
kubernetes.io/hostname: 1111-teuto-scan-2207-control-plane-l42ss-gj8w9
restartPolicy: Never
schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
serviceAccount: trivy-trivy-operator
serviceAccountName: trivy-trivy-operator
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/lib/etcd
type: ""
name: var-lib-etcd
- hostPath:
path: /var/lib/kubelet
type: ""
name: var-lib-kubelet
- hostPath:
path: /var/lib/kube-scheduler
type: ""
name: var-lib-kube-scheduler
- hostPath:
path: /var/lib/kube-controller-manager
type: ""
name: var-lib-kube-controller-manager
- hostPath:
path: /etc/systemd
type: ""
name: etc-systemd
- hostPath:
path: /lib/systemd
type: ""
name: lib-systemd
- hostPath:
path: /etc/kubernetes
type: ""
name: etc-kubernetes
- hostPath:
path: /etc/cni/net.d/
type: ""
name: etc-cni-netd
status:
active: 1
ready: 0
startTime: "2024-03-12T09:34:41Z"
uncountedTerminatedPods: {}
The pod;
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2024-03-12T09:34:41Z"
finalizers:
- batch.kubernetes.io/job-tracking
generateName: node-collector-756ffb6f47-
labels:
app: node-collector
batch.kubernetes.io/controller-uid: 13954e28-513e-47d9-b563-4ca968cc06b0
batch.kubernetes.io/job-name: node-collector-756ffb6f47
controller-uid: 13954e28-513e-47d9-b563-4ca968cc06b0
job-name: node-collector-756ffb6f47
name: node-collector-756ffb6f47-jpvvl
namespace: trivy
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: node-collector-756ffb6f47
uid: 13954e28-513e-47d9-b563-4ca968cc06b0
resourceVersion: "487568797"
uid: addfcdb8-0182-4b5b-ad96-4b7cb2933494
spec:
automountServiceAccountToken: true
containers:
- args:
- k8s
- --node
- 1111-teuto-scan-2207-control-plane-l42ss-gj8w9
command:
- node-collector
image: ghcr.io/aquasecurity/node-collector:0.1.1
imagePullPolicy: Always
name: node-collector
resources:
limits:
cpu: 100m
memory: 100M
requests:
cpu: 50m
memory: 50M
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 10000
runAsNonRoot: true
runAsUser: 10000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/etcd
name: var-lib-etcd
readOnly: true
- mountPath: /var/lib/kubelet
name: var-lib-kubelet
readOnly: true
- mountPath: /var/lib/kube-scheduler
name: var-lib-kube-scheduler
readOnly: true
- mountPath: /var/lib/kube-controller-manager
name: var-lib-kube-controller-manager
readOnly: true
- mountPath: /etc/systemd
name: etc-systemd
readOnly: true
- mountPath: /lib/systemd/
name: lib-systemd
readOnly: true
- mountPath: /etc/kubernetes
name: etc-kubernetes
readOnly: true
- mountPath: /etc/cni/net.d/
name: etc-cni-netd
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-nn2j7
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostPID: true
nodeSelector:
kubernetes.io/hostname: 1111-teuto-scan-2207-control-plane-l42ss-gj8w9
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext:
seccompProfile:
type: RuntimeDefault
serviceAccount: trivy-trivy-operator
serviceAccountName: trivy-trivy-operator
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- hostPath:
path: /var/lib/etcd
type: ""
name: var-lib-etcd
- hostPath:
path: /var/lib/kubelet
type: ""
name: var-lib-kubelet
- hostPath:
path: /var/lib/kube-scheduler
type: ""
name: var-lib-kube-scheduler
- hostPath:
path: /var/lib/kube-controller-manager
type: ""
name: var-lib-kube-controller-manager
- hostPath:
path: /etc/systemd
type: ""
name: etc-systemd
- hostPath:
path: /lib/systemd
type: ""
name: lib-systemd
- hostPath:
path: /etc/kubernetes
type: ""
name: etc-kubernetes
- hostPath:
path: /etc/cni/net.d/
type: ""
name: etc-cni-netd
- name: kube-api-access-nn2j7
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-12T09:34:41Z"
message: '0/6 nodes are available: 3 node(s) didn''t match Pod''s node affinity/selector,
3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption:
0/6 nodes are available: 6 Preemption is not helpful for scheduling..'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable
but at 1st I suggest you set toleration if you want the pod the be schedule taint Node.
Yeah, that would be a short-term solution
But, I think this is a bigger problem. You removed the need for node-selectors for the Clusterinfraassessmentreports, but that just doesn't make sense. trivy can't scan node A while being scheduled on node B.
Not enabling the node-selector completely invalidates the Clusterinfraassessmentreports for nodes and gives a false sense of security/problems.
I might see some problems on "node A", search for hours where it's coming from, how to update it, why trivy thinks that's the case even though I see it differently on the server, just to realize that trivy actually scanned node B and saved it under node A.
The real, long-term solution should be to always enable the node-selector and get the taints for each node, maybe check if they should be ignored,, e.g. for control-planes,, random taints,, stuff like that but not "real" taints like the one from cordon
,, , and create tolerations from that.
@cwrau the param for use node-selector is enable by default. you can choose not to use it (by configuration) so in-term of scan every node will be collected by node-collector
@cwrau the param for use node-selector is enable by default. you can choose not to use it (by configuration) so in-term of scan every node will be collected by node-collector
Ah, perfect, then the only missing part would be the tolerations.
Or, if trivy doesn't want to add tolerations by itself, it shouldn't try to schedule jobs for nodes with taints (that aren't covered by the tolerations)
@cwrau the param for use node-selector is enable by default. you can choose not to use it (by configuration) so in-term of scan every node will be collected by node-collector
Ah, perfect, then the only missing part would be the tolerations.
Or, if trivy doesn't want to add tolerations by itself, it shouldn't try to schedule jobs for nodes with taints (that aren't covered by the tolerations)
this could be enhancements
Any updates on this?
@ltdeoliveira can be easily fixed with adding toleration to node-collector scan job
@ltdeoliveira can be easily fixed with adding toleration to node-collector scan job
@chen-keinan Could you please provide an example? I'm installing the operator with the Helm chart.
@ltdeoliveira make sure nodeAffinity is not workin gin conjunction with tolerations
I'm also having this problem with karpenter nodes, even setting the tolerations and made sure there aren't affinities, node-controller job and pod try to deploy on fargate profiles (even forcing to use nodeSelector
with a karpenter node, which seems to work with trivy-operator but not for node-controller). I has to disable infraassessment
. Any update about this?
We had an upgrade for trivy-operator from v16.0 to v16.4 (chart version 18.4) using ArgoCD. After the upgrade
trivy-operator
spins upnode-collector
. A few went fine, however we have one job which is scheduled on a cordoned node. Due to that, node-collector stays in pending mode, until we terminated it ourselves. I shared here the logs for reference. Dummy question is whether you have excluded cordoned nodes for scheduling, or it is race-condition that leads to this situation?