Closed vasim-test closed 2 months ago
I am also facing same issue.
We are using argo workflow with v3.4.0(0.18.0 - Chart version). Our kubernetes version is 1.25.1.
This a very old version of Argo (and an EoL version of k8s). I would try with the latest patch, i.e. 3.4.17
all versions after > 3.4
It's not clear that you tested any newer versions since you did not specify them at all.
Please check on priority
It will be auto-closed without confirmation that it is tested at least on 3.4.17 and, per the issue template, on :latest
It would further be a P3 still given how few people have not made the upgrade yet, as 3.3 has been unsupported for some time now.
We are using annotations to send logs to fluentbit and also Vault is running as side car container. When we upgrade from 3.3.2 to 3.4.0, We observed the strange issue that logs are not getting visible on elasticsearch/kibana
I also can't reproduce this and I and many users use fluentbit with Argo >3.4, as it's the de facto standard log forwarder on k8s clusters.
image: 123456789.dkr.ecr.ap-south-1.amazonaws.com/test-migration:1.0.97886
This image also looks private, which is counter to the issue template, and would not be reproducible. Your other image my-internal-service
also looks to be a replacement stub/mock as well as private. That would be doubly not reproducible.
There's also a lot going on there, much more than necessary for a minimal reproduction. Please provide a minimal reproduction with a public image that reproduces on at least 3.4.17.
Also when I stop vault as side car container then logs are coming
This would also make it lower priority given that it only happens in a certain scenario, affecting an even smaller group of users.
Priority is partly subjective, but largely based on impact to the userbase, which in your case, is quite segmented.
@agilgur5 Thanks for responding. We also tested with 3.5.8 version, it is having the same behaviour as we are not getting logs.We tired all versions starting from > 2.12.9 to check where it stopped sending logs. We observed argo is working fine and also we are able to get logs until v3.4.0. It means something has been added from 3.4.0 and later version onwards which is causing issues. And Emissary is the major change I observed which starts from v3.4.0. We already tested on latest version (3.5.8) but due this strange behaviour we have role it back from our production env to 3.3.2
I agree most of users are using elastic. In my scenario, I am using elastic as well as vault as side car container. You can see the annotations. Images are private but you can take any image. Only prerequisite is to have EFK(Fluentbit/kibana) and vault setup. So that you can use that annotations like I did.
Let me know if you want more info on the same.
We also tested with 3.5.8 version, it is having the same behaviour as we are not getting logs.We tired all versions starting from > 2.12.9 to check where it stopped sending logs.
Ok, can you reference that in the "Version" section? Or elsewhere in your description.
It means something has been added from 3.4.0 and later version onwards which is causing issues. And Emissary is the major change I observed which starts from v3.4.0.
Yes, your observation is correct, and Emissary runs an initContainer
, a sidecar, and as a parent process to your main
container's command
as well. So it certainly can affect something like this in theory.
Given that this hasn't been reported previously though (despite 3.4 being out for almost 2 years now and Emissary even longer), I imagine it is some specific combination of your services that causes this issue.
To be clear though, Emissary was available since Argo 3.1, was the default in 3.3, and then was the only executor in 3.4+. I assume you were using another executor in your config prior to the 3.4 upgrade? The config you provided here does not list anything, so if you were using that same config in 3.3, you'd be using Emissary in 3.3 as well.
Images are private but you can take any image.
Can you please provide a minimal reproduction with a public image, as the issue template asks?
Only prerequisite is to have EFK(Fluentbit/kibana) and vault setup. So that you can use that annotations like I did.
I don't think Elastic or Kibana are relevant here, as those are not impacted by Argo and are post-processors after fluentbit. I would imagine that your logs are not getting parsed by fluentbit or aren't even making it to stdout
. Can you please debug and confirm which one it is? Again that is important for a minimal reproduction.
You also did not provide the logs of the main or wait containers, although you said there were "no issues" in the logs. That would suggest that Argo passed the logs to stdout
fine and the issue is actually somewhere in your other services. For instance, if your config is not handling Argo's initContainer
or wait
container.
On 3.3.2 , We are using container runtime executor is k8s api. I will try to give workflow and logs , allow me some time, I will share you evidence of I tried 3.5.8 Still following Questions in minds coming.
In version 3.3.2, I have elastic and vault agent as side car. How everything working as expected like logs are coming?
If I moved to v3.5.8 with the same setup of vault and elastic. Vault is doing there job not an issue but why logs stop?
In the same 3.5.8, If we say issue with fluentbit, then when I intentionally stop my vault injection (not allowing vault side car to run) then how logs are coming?
I have number of jobs using workflow but why only the jobs which uses workflowtemplate gets affected? mean to say jobs which uses cronworkflow only that is working fine , able to get logs and vault is also doing its jobs.
Something is happening between emissary , workflowtemplate and sidecars. Mystery!
3. then when I intentionally stop my vault injection (not allowing vault side car to run) then how logs are coming?
Yea I don't know, that's odd that it affects it. If I had to guess, maybe the wait
container or something didn't get properly added to the Pod, since it is also a sidecar? i.e. that the problem is not with the Emissary process itself, but with the Pod being created. It might also be timing-related since Vault uses sidecar injection?
The output of kubectl get pod <my-pod> -o yaml
might be helpful in checking that
4. mean to say jobs which uses cronworkflow only that is working fine , able to get logs and vault is also doing its jobs.
That's bizarre... Are you sure they're otherwise exactly the same? Using a WorkflowTemplate
has no affect on the Pod created itself (it only affects the DAG logic in the Controller, which occurs prior to Pod creation). Maybe your fluentbit config excludes certain labels that your WorkflowTemplate
adds?
If same elastic, same vault agent and same workflow templates works in 3.3 version , it should not cause issue if we are upgrading to 3.5 but it is causing issue after upgrade only.
One more thing I observed is if we remove initcontainer, logs are visible in kibana. I think if same setof my services working fine in 3.3 and not working in latest version means in upgraded version something is causing issues to side car or init container.
Check the document here it is saying something about emissary. and attaching same snap for reference.
https://blog.argoproj.io/sunsetting-docker-executor-support-in-argo-workflows-77d352d601c
it should not cause issue if we are upgrading to 3.5 but it is causing issue after upgrade only.
I'm not sure why you're repeating yourself. Please answer all my previous questions and provide the output, logs, and reproductions asked for, which are necessary to debug. Without those essential debugging pieces, there is nothing actionable we can do. Not answering debugging questions from maintainers is not productive.
Please note that I am an unpaid volunteer maintainer and that reading and responding to all your comments takes time. Please try to make that an effective and efficient use of time.
something is causing issues to side car or init container
Yes, I said that may be the case above as well. I also said twice before that it could very well be your fluentbit config not picking up certain Pods or containers, since you appear to have logs in stdout
fine. Given that other fluentbit users have not faced this before, that would be my prime suspicion.
https://blog.argoproj.io/sunsetting-docker-executor-support-in-argo-workflows-77d352d601c
I know how Emissary works, and this repo contains its source code, so I'm not sure why you're linking a 3 year old notice saying to upgrade to it.
Here's a checklist of things needed and questions unanswered:
kubectl get pod <my-pod> -o yaml
might be helpful in checking that"WorkflowTemplate
adds?"We cannot proceed without these.
I appreciate your time on this. Earlier, We are using k8sapi executor. When when swith to 3.5.8 we removed it from configmap as emissary is default one.
WorkflowTemplate
:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: test-migration-v1-template
spec:
templates:
- name: test-migration-v1-template
container:
command:
- 'echo hello'
image: nginx
name: test-migration-v1
initContainers:
- name: test-config-client
image: busybox:latest
metadata:
annotations:
co.elastic.logs/enabled: "true"
co.elastic.logs/json.add_error_key: "true"
co.elastic.logs/json.keys_under_root: "true"
co.elastic.logs/json.message_key: log
fluentbit.job.name: job-test-migration-v1
vault.hashicorp.com/agent-image: hashicorp/vault:1.6.2
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-inject-secret-secrets.txt: kv/apps/perf/test-migration
vault.hashicorp.com/agent-inject-template-secrets.txt: |
{{ with secret "kv/apps/dev/test-migration" }}{{ range $k, $v := .Data.data }} {{ $k }}={{ $v }}{{ end }}{{ end }}
vault.hashicorp.com/role: perf
securityContext:
fsGroup: 65534
serviceAccountName: test-migration
Create CronWorkflow
:
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: argo-job-test-migration-v1
spec:
schedule: '* * * * *'
workflowSpec:
entrypoint: dag
templates:
- name: dag
dag:
tasks:
- name: test-details-store
templateRef:
name: test-migration-v1-template
template: test-migration-v1-template
Create service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: test-migration
kubectl describe pod
) Note- It doesn't matter if pod running or not , I am expecting logs to come even if pods failed. Also When stop the init container, I starts to get logs for the pod even when it is failed. Regenerating issue simple, You need some init container, side car container and rest of template can be created around it. I already shared sample templates from which I regenerate shared with you.
That is better, but you only completed a single one of the checkboxes I listed out for you above.
This reproduction is better, but still not minimal:
creationTimestamp
, generation
, resourceVersion
, uid
, labels
, annotation
, and status
. Some of these will make it impossible to apply
directly.{}
, unused volumes
+ volumeMounts
(they do nothing in your manifest), unused env
synchronization
, concurrencyPolicy
, activeDeadlineSeconds
, ttlStrategy
, and an input
that can just be hard-codedworkflowtaskresults
) and Vault (depending on your set-up)Minimal means as small as possible. The minimum necessary to reproduce, with nothing extra.
I made several edits to your comment, please read them and note the differences. Please note how they substantially shrink the reproduction. You also had some poor formatting, please see the multiple significant edits I made to your comment. Those edits take a lot of time and are things you could have done yourself initially. In the future, when asking for assistance or reporting an issue to any OSS repo, please make those yourself.
You also did not answer my two fluentbit
-related questions, which I suspect are critical to the root cause of this issue. If your output makes it to stdout
, then it is likely not Argo-related, but specific your fluentbit
config.
argo job pod - manifest
This isn't a manifest, this is the output of kubectl describe pod
, whereas I had asked for kubectl get pod -o yaml
.
This may suffice though. The init
, main
, and wait
containers are all there, as is your additional initContainer
and your vault-agent-init
and vault-agent
containers.
Although this seems to be a running Pod, not a completed one.
Also When stop the init container, I starts to get logs for the pod even when it is failed.
Does the init container not run to completion on its own? Is it always stuck? That would be a very important, critical detail that you did not previously mention.
From the events
, it looks like it does not finish, since it started 109s
ago.
That would be the issue if so (not log propagation); there are no logs because the Pod is stuck and all containers haven't run. The question would then be why is the vault-agent-init
container stuck. fluentbit
and logs as a whole would be entirely unrelated.
That container is not running Emissary as a parent process though, so it's purely running Vault code.
I made several edits to your comment, please read them and note the differences. Please note how they substantially shrink the reproduction. You also had some poor formatting, please see the multiple significant edits I made to your comment. Those edits take a lot of time and are things you could have done yourself initially. In the future, when asking for assistance or reporting an issue to any OSS repo, please make those yourself.
Thank You. It looks good. I noted this.
You also did not answer my two
fluentbit
-related questions, which I suspect are critical to the root cause of this issue. If your output makes it tostdout
, then it is likely not Argo-related, but specific yourfluentbit
config.
In my day time I will share you my config for fluent bit.
Although this seems to be a running Pod, not a completed one.
Yes it is not completed one but I am expecting that failure logs also. I can able to get the failure logs after remove initcontainer section.
Also When stop the init container, I starts to get logs for the pod even when it is failed.
Does the init container not run to completion on its own? Is it always stuck? That would be a very important, critical detail that you did not previously mention. From the
events
, it looks like it does not finish, since it started109s
ago.
I mean to say, if I removed init container section from template, I am getting logs. Even if my initcontainer completed, unable to get logs. I regenerated issue by doing this with public images.
That would be the issue if so (not log propagation); there are no logs because the Pod is stuck and all containers haven't run. The question would then be why is the
vault-agent-init
container stuck.fluentbit
and logs as a whole would be entirely unrelated. That container is not running Emissary as a parent process though, so it's purely running Vault code.
Vault agent is stuck because it is not having that secret. Just for regenerate I have not created secret. I was able to regenerate without creating secret in vault also. I am aware my pod will not complete because of this. But I am seeing that failure logs also in kibana if remove initcontainer section.
Thanks Anton. Its strange! When I removed this annotations co.elastic.logs/json.keys_under_root: "true"
I am getting logs.🙂
But I am confused here, how the same set of annotations works in 3.3 version.
Any logging related changes do I need to make in this version, please let me know if you have any idea. Because not sure if removing this annotations can cause any other issue.
Fluent Bit Conf -
apiVersion: v1
data:
custom_parsers.conf: |
[PARSER]
Name docker_no_time
Format json
Time_Keep Off
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
fluent-bit.conf: |
[SERVICE]
Flush 1
Daemon Off
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
# Log_Level debug
storage.metrics on
[INPUT]
Name tail
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/argo-job-*.log,/var/log/containers/argo-workflow-*.log
Refresh_Interval 10
multiline.parser docker, cri
Tag jio_service.*
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[INPUT]
Name tail
Path /var/log/containers/argo-job-*.log
Exclude_Path /var/log/containers/argo-workflow-*.log
Refresh_Interval 10
multiline.parser docker, cri
Tag argo_job.*
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[INPUT]
Name tail
Path /var/log/containers/ingress-*.log
Exclude_Path /var/log/containers/argo-workflow-*.log
Refresh_Interval 10
multiline.parser docker, cri
Tag ingress.*
Mem_Buf_Limit 100MB
Buffer_Chunk_Size 1MB
Buffer_Max_Size 5MB
Skip_Long_Lines Off
[FILTER]
Name kubernetes
Match jiao_service.*
Labels On
Annotations Off
Merge_Log On
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
Kube_Tag_Prefix neo_service.var.log.containers.
[FILTER]
Name kubernetes
Match argo_job.*
Labels On
Annotations On
Merge_Log On
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Kube_Tag_Prefix argo_job.var.log.containers.
[FILTER]
Name kubernetes
Match ingress.*
Labels On
Annotations On
Merge_Log On
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
Kube_Tag_Prefix ingress.var.log.containers.
[OUTPUT]
Name es
Match jiao_service.*
Host multi-client
Port 9200
HTTP_User elastic
HTTP_Passwd test@123
Logstash_Format On
logstash_prefix_key $kubernetes['namespace_name'].$kubernetes['container_name']
Retry_Limit 5
tls off
tls.verify off
Suppress_Type_Name On
Replace_Dots On
Trace_Error on
Generate_ID On
[OUTPUT]
Name es
Match argo_job.*
Host multi-client
Port 9200
HTTP_User elastic
HTTP_Passwd test@123
Logstash_Format On
logstash_prefix_key $kubernetes['namespace_name'].$kubernetes['annotations']['fluentbit.job.name']
Retry_Limit 5
tls off
tls.verify off
Replace_Dots On
Suppress_Type_Name On
Generate_ID On
[OUTPUT]
Name es
Alias Ingress
Match ingress.*
Host multi-client
Port 9200
HTTP_User elastic
HTTP_Passwd test@123
Logstash_Format On
logstash_prefix_key $kubernetes['namespace_name'].$kubernetes['labels']['app.kubernetes.io/name']
Retry_Limit 5
tls off
tls.verify off
net.keepalive Off
Suppress_Type_Name On
Replace_Dots On
Generate_ID On
Thanks Anton. Its strange! When I removed this annotations
co.elastic.logs/json.keys_under_root: "true"
I am getting logs.🙂
This is therefore a problem in fluentbit, and not in argo workflows.
Any logging related changes do I need to make in this version, please let me know if you have any idea. Because not sure if removing this annotations can cause any other issue.
I suggest you research or ask around in the elastic/fluentbit community.
Still one question comes in my mind, if possible for you can answer, if same annotations works in 3.3 version of argo, what is that in 3.4+ version of argo workflow stopping it to send log and interestingly without removing mentioned elastic annotations also it work if I stop/remove sidecar or init container in the pod. Thanks!
I don't know. Fluentbit is the one not sending logs, not workflows.
Upon further investigation, I discovered that there has been a change in metadata(annotations) for Workflowtemplate CRDs in version 3.4+ of Argo. This change has affected fluentbit's ability to process logs for jobs that specifically utilize Workflowtemplate. Additionally, due to increased metadata for pods, fluentbit now requires a larger buffer size (default 32KB) to temporarily store log data. Solution for the users who faced the issue is just go to fluent bit config and add this in kubernetes filter. Buffer_Size 5MB
Its all about metadata, no need to remove any annotations or not even need to make any change from argo end. Simply increase your buffer size
[FILTER]
Name kubernetes
Match argo_job.*
Labels On
Annotations On
Merge_Log On
Keep_Log On
Buffer_Size 5MB
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Kube_Tag_Prefix argo_job.var.log.containers.
Note - Its neither fluent bit issue nor argo issue It is where one should fine tune the configuration with respect to such changes.
In my day time I will share you my config for fluent bit.
Note that I did not ask for your fluentbit config, I asked you to tick off the checkboxes I wrote out for you above.
Note - Its neither fluent bit issue nor argo issue It is where one should fine tune the configuration with respect to such changes.
Specifically, this answers the 3rd one.
In the future, I would please ask, once again, that you answer questions from maintainers and follow directions on minimal repros and such. Doing so will also make you a better debugger and perhaps even able to root cause and resolve such issues on your own. That would also be a more efficient and effective use of your time as well as maintainer time, which is especially limited.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
We are using argo workflow with v3.4.0(0.18.0 - Chart version). Our kubernetes version is 1.25.1.
For our applications, We have Cronworkflow, Workflowtemplate object. In Workflowtemplate, We are using annotations to send logs to fluentbit and also Vault is running as side car container.
When we upgrade from 3.3.2 to 3.4.0, We observed the strange issue that logs are not getting visible on elasticsearch/kibana. Previously it was working. There must be some change w.r.t to emissary which is causing this trouble. Also when I stop vault as side car container then logs are coming to kibana but it is not possible for us to stop using vault.
I shared you my cronworkflow, workflowtemplate for reference.
Please check on priority because this is bug which is there on all versions after > 3.4
Version
v3.4.0, v3.5.8
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container