Closed KevinDW-Fluxys closed 6 months ago
@KevinDW-Fluxys you can set toleration from scan job with in this param
@chen-keinan Unfortunately I already have this parameter set, thats also why my operator and the vulnerability scan jobs are running on the correct node. It seems like the node collector does not take this parameter into account. (or something else is wrong)
@KevinDW-Fluxys can you please describe or fetchnode-collector
job manifest to confirm toleration has been set correctly ?
also can you please share logs or node-collector pod status ?
@KevinDW-Fluxys can you please describe or fetch
node-collector
job manifest to confirm toleration has been set correctly ? also can you please share logs or node-collector pod status ?
There is no job being spawned. When i restart the trivy-operator
pod it fires up some scan-vulnerabilityreport jobs but no node-collector. If i remove the excludeNodes
parameter i do get a job which you can see below. The tolerations are there as expected, so that will probably not be the problem as i suspected first.
apiVersion: batch/v1
kind: Job
metadata:
name: node-collector-647fddb8f4
namespace: trivy-operator
labels:
app.kubernetes.io/managed-by: trivy-operator
node-info.collector: Trivy
trivy-operator.resource.kind: Node
trivy-operator.resource.name: aks-connect-34676365-vmss000000
spec:
parallelism: 1
completions: 1
activeDeadlineSeconds: 300
backoffLimit: 0
selector:
matchLabels:
batch.kubernetes.io/controller-uid: bfaadc70-637c-4b1e-913b-325ab8bf825d
template:
metadata:
creationTimestamp: null
labels:
app: node-collector
batch.kubernetes.io/controller-uid: bfaadc70-637c-4b1e-913b-325ab8bf825d
batch.kubernetes.io/job-name: node-collector-647fddb8f4
controller-uid: bfaadc70-637c-4b1e-913b-325ab8bf825d
job-name: node-collector-647fddb8f4
spec:
volumes:
...
containers:
- name: node-collector
...
nodeSelector:
kubernetes.io/hostname: aks-connect-34676365-vmss000000
...
schedulerName: default-scheduler
tolerations:
- key: agentpool
operator: Equal
value: default
effect: NoExecute
completionMode: NonIndexed
suspend: false
@KevinDW-Fluxys you mention that you set the param excludeNodes
, can you explain why you wanted to use it?
you also mention that once you remove the excludeNodes
the job do appear, can you tell what is the status of the pod of node-collector
?
@chen-keinan we have a situation where we have 2 nodepools (default & connect), and we only want to schedule on 1 of those nodepools (default). The nodepool we want to schedule on has a Taint, and the other one doesnt. They also both have labels with the name of the nodepool. Thats why i want to exclude the nodepool by passing the label via excludeNodes
and i want it to ignore the taint via the toleration.
The job is on status pending with following events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m32s default-scheduler 0/11 nodes are available: 2 node(s) had untolerated taint {agentpoo
l: connect}, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/11 nodes are available: 11 Preemption is not
helpful for scheduling..
Normal NotTriggerScaleUp 4m31s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affi
nity/selector, 1 node(s) had untolerated taint {agentpool: connect}
@KevinDW-Fluxys node-collector use nodeSelector
as it need to run on every node
maybe you want to set scanJobAffinity if it can run in conjunction with tolerations
nodeCollector does not support Toleration. Or also scanJobTolerations or set on the Node Collector ?
@chen-keinan I have added the following affinity but now i have no node-collector jobs again.
scanJobAffinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- default
I also noticed i had filled out the following parameter which might be relevant:
scanJobNodeSelector:
agentpool: default
Maybe its good to give a summary of all parameters which are related:
operator:
# -- scanNodeCollectorLimit the maximum number of node collector jobs create by the operator
scanNodeCollectorLimit: 1
trivyOperator:
# -- scanJobAffinity affinity to be applied to the scanner pods and node-collector
scanJobAffinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- default
# -- scanJobTolerations tolerations to be applied to the scanner pods and node-collector so that they can run on nodes with matching taints
scanJobTolerations:
- key: "agentpool"
operator: "Equal"
value: "default"
effect: NoExecute
# -- If you do want to specify tolerations, uncomment the following lines, adjust them as necessary, and remove the
# square brackets after 'scanJobTolerations:'.
# - key: "key1"
# operator: "Equal"
# value: "value1"
# effect: "NoSchedule"
# -- scanJobNodeSelector nodeSelector to be applied to the scanner pods so that they can run on nodes with matching labels
scanJobNodeSelector:
agentpool: default
# -- If you do want to specify nodeSelector, uncomment the following lines, adjust them as necessary, and remove the
# square brackets after 'scanJobNodeSelector:'.
# nodeType: worker
# cpu: sandylake
# teamOwner: operators
# -- tolerations set the operator tolerations
tolerations:
- key: "agentpool"
operator: "Equal"
value: "default"
effect: NoExecute
# -- affinity set the operator affinity
affinity: {}
nodeCollector:
# -- useNodeSelector determine if to use nodeSelector (by auto detecting node name) with node-collector scan job
useNodeSelector: true
# -- excludeNodes comma-separated node labels that the node-collector job should exclude from scanning (example kubernetes.io/arch=arm64,team=dev)
excludeNodes: "agentpool=connect"
@KevinDW-Fluxys the 2 param we discussed toleration and affinity
@chen-keinan
@KevinDW-Fluxys the 2 param we discussed toleration and affinity
They are set as you can see in my previous reply, or should they be set differently?
@KevinDW-Fluxys currently scanJobNodeSelector
will work only for vulnerability scan-job not for node-collector, it require a small change to support it in node-collector. other then that setting looks ok
@KevinDW-Fluxys currently
scanJobNodeSelector
will work only for vulnerability scan-job not for node-collector, it require a small change to support it in node-collector. other then that setting looks ok
Would be very nice that the NodeSelector also supports Node-Collector
@chen-keinan
@KevinDW-Fluxys currently
scanJobNodeSelector
will work only for vulnerability scan-job not for node-collector, it require a small change to support it in node-collector. other then that setting looks ok
With this configuration i'm not getting any node-collector jobs spawned. If the config looks ok i'm not sure what is going wrong. Any idea what i can do to make my setup to work? The use case is in essence quite simple: make node collector run on nodes with a certain taint, and not on nodes with a specific label.
@chen-keinan
@KevinDW-Fluxys currently
scanJobNodeSelector
will work only for vulnerability scan-job not for node-collector, it require a small change to support it in node-collector. other then that setting looks okWith this configuration i'm not getting any node-collector jobs spawned. If the config looks ok i'm not sure what is going wrong. Any idea what i can do to make my setup to work? The use case is in essence quite simple: make node collector run on nodes with a certain taint, and not on nodes with a specific label.
to be honest I do not know, the right params are set, maybe something is conflicting
@chen-keinan After some more tests i noticed that the nodeselector on the job is being set to the hostname of a node that should not have been selected. I have taints and tolerations set to avoid this node, but still its hostname is added as a nodeselector. Since the job also has the correct taints its logical that it cant be scheduled.
nodeSelector:
kubernetes.io/hostname: aks-connect-34676365-vmss000000
I am also running it on a second cluster where we only have 1 nodepool, and there the job is not being spawned at all. I'm not sure how the nodeselector is being generated, but probably the issue lies there.
i have also tried toggling the following parameter, but it does not seem to have any effect.
nodeCollector:
# -- useNodeSelector determine if to use nodeSelector (by auto detecting node name) with node-collector scan job
useNodeSelector: true
@chen-keinan After some more tests i noticed that the nodeselector on the job is being set to the hostname of a node that should not have been selected. I have taints and tolerations set to avoid this node, but still its hostname is added as a nodeselector. Since the job also has the correct taints its logical that it cant be scheduled.
nodeSelector: kubernetes.io/hostname: aks-connect-34676365-vmss000000
I am also running it on a second cluster where we only have 1 nodepool, and there the job is not being spawned at all. I'm not sure how the nodeselector is being generated, but probably the issue lies there.
i have also tried toggling the following parameter, but it does not seem to have any effect.
nodeCollector: # -- useNodeSelector determine if to use nodeSelector (by auto detecting node name) with node-collector scan job useNodeSelector: true
setting this flag to false will not assign node-collector
to each Node
meaning node-collector
potentially can be assign to same Node
and not to every Node
@chen-keinan After some more tests i noticed that the nodeselector on the job is being set to the hostname of a node that should not have been selected. I have taints and tolerations set to avoid this node, but still its hostname is added as a nodeselector. Since the job also has the correct taints its logical that it cant be scheduled.
nodeSelector: kubernetes.io/hostname: aks-connect-34676365-vmss000000
I am also running it on a second cluster where we only have 1 nodepool, and there the job is not being spawned at all. I'm not sure how the nodeselector is being generated, but probably the issue lies there. i have also tried toggling the following parameter, but it does not seem to have any effect.
nodeCollector: # -- useNodeSelector determine if to use nodeSelector (by auto detecting node name) with node-collector scan job useNodeSelector: true
setting this flag to false will not assign
node-collector
to eachNode
meaningnode-collector
potentially can be assign to sameNode
and not toevery Node
@chen-keinan Thanks, that makes sense. However it does not change the issue I'm having. Do you have any idea what might be wrong with the generation of the job and the setting of the hostname?
@chen-keinan After some more tests i noticed that the nodeselector on the job is being set to the hostname of a node that should not have been selected. I have taints and tolerations set to avoid this node, but still its hostname is added as a nodeselector. Since the job also has the correct taints its logical that it cant be scheduled.
nodeSelector: kubernetes.io/hostname: aks-connect-34676365-vmss000000
I am also running it on a second cluster where we only have 1 nodepool, and there the job is not being spawned at all. I'm not sure how the nodeselector is being generated, but probably the issue lies there. i have also tried toggling the following parameter, but it does not seem to have any effect.
nodeCollector: # -- useNodeSelector determine if to use nodeSelector (by auto detecting node name) with node-collector scan job useNodeSelector: true
setting this flag to false will not assign
node-collector
to eachNode
meaningnode-collector
potentially can be assign to sameNode
and not toevery Node
@chen-keinan Thanks, that makes sense. However it does not change the issue I'm having. Do you have any idea what might be wrong with the generation of the job and the setting of the hostname?
can you please share again the error you get when toleration is set, I'll try to investigate it
@chen-keinan Sure, that would be this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m32s default-scheduler 0/11 nodes are available: 2 node(s) had untolerated taint {agentpoo
l: connect}, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/11 nodes are available: 11 Preemption is not
helpful for scheduling..
Normal NotTriggerScaleUp 4m31s cluster-autoscaler pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affi
nity/selector, 1 node(s) had untolerated taint {agentpool: connect}
As some extra information/summary: We have 2 clusters:
All nodepools have a taint agentpool with their cluster-name.
When running with the above configuration on Cluster A, there are no node-collector jobs being generated. When running with the same configuration on cluster B, there is a job being created with a nodeSelector on a connect-node (see above) and a toleration for the default nodepool.
Therefor the error message makes sense: He wants to schedule on a connect node, but the taint is untolerated, the other nodes dont match the selector.
@KevinDW-Fluxys do you mind also please share you node configuration so I could try to reproduce it.
kubectl get Node <node name> -o yaml
without exposing sensitive information
@chen-keinan You can find a redacted version of the nodes below. I think the main thing to do is to have only nodes with a taint, and then try to get the node-collector job to schedule on nodes with a specific taint / label. As said before, we have a cluster where we have only 1 taint, where no job spawns, and a cluster with both taints where a job spawns but with a nodeselector on the nodename of the pool where we dont want it.
apiVersion: v1
kind: Node
metadata:
annotations:
...
labels:
agentpool: default
name: aks-default-***-vmss000001
...
spec:
...
taints:
- effect: NoExecute
key: agentpool
value: default
status:
...
nodeInfo:
architecture: amd64
bootID: ***
containerRuntimeVersion: containerd://1.7.7-1
kernelVersion: 5.15.0-1057-azure
kubeProxyVersion: v1.28.5
kubeletVersion: v1.28.5
machineID: ***
operatingSystem: linux
osImage: Ubuntu 22.04.4 LTS
systemUUID: ***
---
apiVersion: v1
kind: Node
metadata:
annotations:
...
labels:
agentpool: connect
name: aks-connect-***-vmss000001
...
spec:
...
taints:
- effect: NoExecute
key: agentpool
value: connect
status:
...
nodeInfo:
architecture: amd64
bootID: ***
containerRuntimeVersion: containerd://1.7.7-1
kernelVersion: 5.15.0-1057-azure
kubeProxyVersion: v1.28.5
kubeletVersion: v1.28.5
machineID: ***
operatingSystem: linux
osImage: Ubuntu 22.04.4 LTS
systemUUID: ***
I found this issue because I had the same or similar issue where a node-collector
pod was not scheduling to my control-plane (master) node with an error about no matching tolerations for the node's taint.
this commit (and possibly a restart of the privy operator pod) got this sorted and working after a few mins, FWIW.
From the helm configuration,
trivyOperator:
scanJobTolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
@KevinDW-Fluxys any luck with @billimek suggestion ?
@chen-keinan I haven't tried it since it seems highly unlikely that this will work for us. His use case was getting something to run on the sytem nodepool, for which he had to tolerate a taint that was on the system nodepool. We are already tolerating our taints.
To be sure it wasn't a remark on the syntax I have updated my syntax to his syntax, and set it to check simply for the taint to exists (which is already a much wider toleration) but with the same result
scanJobTolerations:
- key: "agentpool"
operator: "Exists"
I also want to point out that the vulnerability scanjobs are scheduling as expected, and responding to the tolerations as desired. Its only the nodecollector that isnt scheduling as expected
This is because vulnerability scan-job don't care on which Node he it running and node collector does, if you unset the Node selector flag you'll get the same results
@chen-keinan As another addition, i noticed the following in the node-collector definition when it spawns:
apiVersion: v1
kind: Pod
metadata:
name: node-collector-649bbb854f-s5swz
namespace: trivy-operator
spec:
volumes:
...
containers:
- name: node-collector
image: ***/aquasecurity/node-collector:0.1.2
command:
- node-collector
args:
- k8s
- '--node'
- aks-connect-***-vmss000001
The container is being passed an arg with the name of the (wrong) node
@chen-keinan As another addition, i noticed the following in the node-collector definition when it spawns:
apiVersion: v1 kind: Pod metadata: name: node-collector-649bbb854f-s5swz namespace: trivy-operator spec: volumes: ... containers: - name: node-collector image: ***/aquasecurity/node-collector:0.1.2 command: - node-collector args: - k8s - '--node' - aks-connect-***-vmss000001
The container is being passed an arg with the name of the (wrong) node
does the --node
value different from nodeSelector
value ?
@chen-keinan As another addition, i noticed the following in the node-collector definition when it spawns:
apiVersion: v1 kind: Pod metadata: name: node-collector-649bbb854f-s5swz namespace: trivy-operator spec: volumes: ... containers: - name: node-collector image: ***/aquasecurity/node-collector:0.1.2 command: - node-collector args: - k8s - '--node' - aks-connect-***-vmss000001
The container is being passed an arg with the name of the (wrong) node
does the
--node
value different fromnodeSelector
value ?
@chen-keinan no, the nodeSelector
value is exactly the same. If the nodeSelector is the deciding factor, then the calculation of that one is wrong, since this is a node with an untolerated taint.
If you can point me to where this logic is in the code i can try to figure it out myself, although i'm not well versed in go, i could still give it a go :)
@chen-keinan As another addition, i noticed the following in the node-collector definition when it spawns:
apiVersion: v1 kind: Pod metadata: name: node-collector-649bbb854f-s5swz namespace: trivy-operator spec: volumes: ... containers: - name: node-collector image: ***/aquasecurity/node-collector:0.1.2 command: - node-collector args: - k8s - '--node' - aks-connect-***-vmss000001
The container is being passed an arg with the name of the (wrong) node
does the
--node
value different fromnodeSelector
value ?@chen-keinan no, the
nodeSelector
value is exactly the same. If the nodeSelector is the deciding factor, then the calculation of that one is wrong, since this is a node with an untolerated taint.If you can point me to where this logic is in the code i can try to figure it out myself, although i'm not well versed in go, i could still give it a go :)
once you define a toleration it will apply for all Node trivy-operator reconcile. you are suggesting to set toleration only for Nodes with matching taints ?
@chen-keinan Advised me to have a look at affinity and tolerations which might have been conflicting. I have removed the affinity, and now its working.
For anyone else wondering, my final configuration is:
trivyOperator:
...
# -- scanJobAffinity affinity to be applied to the scanner pods and node-collector
scanJobAffinity: []
# -- scanJobTolerations tolerations to be applied to the scanner pods and node-collector so that they can run on nodes with matching taints
scanJobTolerations:
- key: "agentpool"
operator: "Exists"
# -- scanJobNodeSelector nodeSelector to be applied to the scanner pods so that they can run on nodes with matching labels
scanJobNodeSelector:
agentpool: default
This applies for both my scanjobs and node-collectors, but there is a PR which should be merged soon that will separate these two: https://github.com/aquasecurity/trivy-operator/pull/2006
What steps did you take and what happened: I have installed trivy-operator with the helm chart, and have configured the node collector as follows:
When i do this, the nodecollector job does not appear. If i remove the
excludeNodes="agentpool=connect"
he tries to schedule on those nodes (which i dont want) The trivy-operator pods and scanvulnerabilities jobs are running on the correct nodes.My suspicion is that this is caused by the fact that these nodes have taints, and the nodecollector does not have a tolerations parameter and does not take the tolerations parameter of the trivy-operator into account.
What did you expect to happen: Node collector jobs are scheduled on the correct nodes.
Anything else you would like to add:
Environment:
trivy-operator version
): 0.19.3kubectl version
): 1.28.5