Open ndejong opened 11 months ago
maybe caused by https://github.com/argoproj/argo-workflows/pull/11781 ?
cc @shmruin Would you like to take a look?
Possible workaround may be to use an expr
expression, i.e. {{=steps.foobar.outputs.artifacts['main-logs']}}
Yes, this is because of #11781. It seems start checking resolveAllVariables makes problem.
The original code doesn't check this because when checkValidWorkflowVariablePrefix, the untrimmed steps.
is not match to steps.
so resolveAllVariables just returns nil without error. (which is checking global variable prefix)
But with the change of trimming, as #11871 does, it starts checking the tag and steps.foobar.outputs.artifacts.main-logs
does not pass this. I check this by some small print as below.
Bad Request: templates.entrypoint-handler.steps failed to resolve {{ steps.foobar.outputs.artifacts.main-logs }}, scope is map[steps.foobar.exitCode:true steps.foobar.finishedAt:true steps.foobar.hostNodeName:true steps.foobar.id:true steps.foobar.outputs.result:true steps.foobar.startedAt:true steps.foobar.status:true steps.workflow-processor.status:true workflow.annotations:placeholder-5 workflow.annotations.json:placeholder-6 workflow.labels:placeholder-7 workflow.labels.json:placeholder-8 workflow.mainEntrypoint:placeholder-2 workflow.name:placeholder-0 workflow.namespace:placeholder-1 workflow.serviceAccountName:placeholder-3 workflow.status:placeholder-9 workflow.uid:placeholder-4], globalParam is map[workflow.annotations:placeholder-5 workflow.annotations.json:placeholder-6 workflow.labels:placeholder-7 workflow.labels.json:placeholder-8 workflow.mainEntrypoint:placeholder-2 workflow.name:placeholder-0 workflow.namespace:placeholder-1 workflow.serviceAccountName:placeholder-3 workflow.status:placeholder-9 workflow.uid:placeholder-4], tag is ' steps.foobar.outputs.artifacts.main-logs ', trimmedTag is 'steps.foobar.outputs.artifacts.main-logs'
It looks like ...artifacts
not exist in scope. (but workflow.outputs.artifacts.
is exist)
So, steps.foobar.outputs.artifacts.main-logs
is neither scope or globalParams so it should pass some if-else cases but nothing match so makes error.
Does it mean that the case checks of resolveAllVariables
should be updated? I want to know proper validation check case that something like steps.foobar.outputs.artifacts.main-logs
should be placed.
Thanks for the in-depth investigation @shmruin!
Does it mean that the case checks of
resolveAllVariables
should be updated? I want to know proper validation check case that something likesteps.foobar.outputs.artifacts.main-logs
should be placed.
It seems like the actual execution has been allowing this for some time, so it indeed sounds like resolveAllVariables
needs updating. Basically the validation did not match the actual execution.
There might be some other cases like this one that were missed due to the lack of trimming 😕 Basically one bug was hiding a few more validation bugs underneath.
It looks like 'artifacts' keyword of output is quite special. By watching the log again, it is not in 'scope' but should works as a keyword. Maybe I can just generalize 'workflows.output.artifacts.' prefix case to something like '*.outputs.artifacts'.
@agilgur5 , is main-log
be dealt with some artifact without explicitly using output?
I check that this error happens because foobar
step has no ouput artifacts keyword.
If it is like below, then that template will works.
- name: foobar
inputs:
parameters:
- { name: target }
container:
image: docker/whalesay
command: [ "cowsay" ]
args: [ "{{ inputs.parameters.target }}" ]
outputs:
artifacts:
- name: main-log
path: SOME_MAIN_LOG_PATH
See addOutputsToScope.
If it is the only case, then I think we can make main-log
to pass the validation even when not using output keyword.
@agilgur5 , is
main-log
be dealt with some artifact without explicitly using output?
oh I actually missed that, good catch!
If so, this isn't documented anywhere I could find. That may be how archive logs function? @ndejong are you using archive logs to store your logs as artifacts?
Thanks for the time and attention on this -
@shmruin - unless I'm mis understanding something here, I don't see how this works
output: artifacts: - name: main-log path: SOME_MAIN_LOG_PATH
By using the .main-logs
approach as per our use case we are accessing the terminal output of the Argo Workflow itself, not some artifact that is written within the container as this seems to suggest.
@agilgur5 - yes we are picking up other artifacts from both steps in our workflows, our ability to access the .main-logs
in the later step(s) allows us to make decisions based on the terminal output(s) from the earlier step.
Is there some other way to access the Argo Workflow terminal-output (ie .main-log
) of a step in a later-step that we are not understanding here?
@agilgur5 - yes we are picking up other artifacts from both steps in our workflows, our ability to access the
.main-logs
in the later step(s) allows us to make decisions based on the terminal output(s) from the earlier step.
Is this specifically via the archive logs feature?
If so, this isn't mentioned in your issue or explicitly used in the Workflows you provided -- could you also provide your workflow-controller-configmap
and/or your artifact-repositories
ConfigMap?
I imagine that you have an archive logs configuration there
Is there some other way to access the Argo Workflow terminal-output (ie
.main-log
) of a step in a later-step that we are not understanding here?
The documented way of doing this is accessing through outputs.result
.
As far as I can currently tell, main-log
seems to be undocumented behavior of the archive logs feature that you're making use of. There is no explicit reference to that in the entire codebase -- I believe that just happens to be what your archive logs artifact is named in your global or namespace ConfigMap rather than a specific keyword that Argo uses
Is this specifically via the archive logs feature? If so, this isn't mentioned in your issue or explicitly used in the Workflows you provided -- could you also provide your
workflow-controller-configmap
and/or yourartifact-repositories
ConfigMap? I imagine that you have an archive logs configuration there
Yes I see, had not occurred this was a possibility here; yes we use a local minio instance to provide temporary artifact storage in-between steps; indeed the defaults section of our workflow-controller-configmap
contains a archiveLogs: true
Our workflow-controller-configmap
trimmed down
workflowDefaults:
spec:
activeDeadlineSeconds: 900
archiveLogs: true
securityContext:
runAsGroup: 2888
runAsNonRoot: true
runAsUser: 2888
serviceAccountName: argo-workflow-sa
ttlStrategy:
secondsAfterCompletion: 10800
... and our artifact-repositories
s3:
bucket: artifacts
endpoint: minio.argo.svc.cluster.local:9000
accessKeySecret:
name: minio-credentials
key: accesskey
secretKeySecret:
name: minio-credentials
key: secretkey
There are some things that don't quite align though -
main-logs
name anywhere in our codebase either.main-logs
does appear in Argo Workflow codebase for various _tests though (NB: the "s" in log"s").main-logs
is somehow a default (from where?) that gets applied when we set archiveLogs: true
without supplying a name in workflow-controller-configmap
as is the case with our setup.I should also describe then:-
main-logs
does not seem to appear in the argo-helm repo either.We discovered this technique of accessing main-logs
after noticing it was referenced in the INPUTS/OUTPUTS tab of the related step in the Argo Workflows user interface, and followed that through until we discovered it was available as an artifact - was not aware this is undocumented, it just made sense at the time.
The reason we use this main-logs
artifact approach is that the parameter based outputs.result
approach fails when terminal outputs contain anything binary.
See the updated example below (on v3.4.11) that fails if you uncomment the 2x parameters items below - it seems that the outputs.result
parameter passing approach will fail with a MESSAGE cannot finish template replacement because the result was invalid JSON
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: bugreport-20231023a
spec:
entrypoint: entrypoint-handler
templates:
- name: entrypoint-handler
steps:
- - name: step1
template: step1-template
- - name: step2
template: step2-template
arguments:
artifacts: [ { name: artifact_from_step1, from: "{{ steps.step1.outputs.artifacts.main-logs }}" } ]
# parameters: [ { name: parameter_from_step1, value: "{{ steps.step1.outputs.result }}" } ]
- name: step1-template
container:
image: debian:stable-slim
command: ["/bin/bash", "-c"]
args:
- |
echo "Binary output to STDOUT"
head -c256 /dev/urandom
echo "Binary output to STDERR"
>&2 head -c256 /dev/urandom
- name: step2-template
inputs:
artifacts: [ { name: artifact_from_step1, path: "/tmp/artifact_from_step1" } ]
# parameters: [ { name: parameter_from_step1 } ]
container:
image: debian:stable-slim
command: ["/bin/bash", "-c"]
args:
- |
cat "/tmp/artifact_from_step1"
To boil all that down into summary:-
main-logs
artifacts approach because we discovered that we could not safely pass logs from one step to another using the parameter based outputs.result
approachThanks for digging in and providing that detail @ndejong!
Since you confirmed it wasn't in your codebase but that you did have archiveLogs
enabled, I searched a bit harder and eventually found the line in Argo's codebase where this is created. It's dynamically created as containerName + "-logs"
, so it did not quite pop up in a search for main-logs
(and the tests had explicit name
fields too) 😅. init
, main
, and wait
are the default names for Argo's Executor containers (as you may have noticed).
@shmruin we may need to add GetMainContainerNames()
+ -logs
to the validation logic when archiveLogs
is turned on. I'm not sure if that requires any run-time knowledge or can be statically defined though.
This is still relying on undocumented behavior, so I am a little hesitant to include this, but it is currently a regression since that behavior was allowed before (despite being undocumented). @terrytangyuan any thoughts on allowing the undocumented behavior? I'm leaning toward allowing it
Is this "just" a matter of documentation or does it run deeper that that? (sorry, I'm not across the Argo code base)
As described above, we discovered the documented parameter outputs.result
approach is not appropriate for our use case where-as our funky "reach-back" artifact based logs approach addressed our problem and reduced the scope for unexpected injection hazards in subsequent steps.
Would be happy to create documentation if that's all that's required.
Is this "just" a matter of documentation or does it run deeper that that? (sorry, I'm not across the Argo code base)
it would be deeper than that. it may have never really been intended to be exposed to users, I'm not sure. archiveLogs
creates an artifact under-the-hood -- and you happened to figure out and use that internal functionality.
it being "undocumented behavior" is less that it lacks documentation and more that it may never have been a documented "feature" to begin with. as in, it may have just been intended as an internal, and not a user-facing feature.
you're currently using it as a feature, so there's a decision to be made here -- should we support that because it happened to be possible before? or should we not support that as that's an internal and was never actually part of the spec? meaning that your prior usage was unintended behavior.
That's why I tagged Terry as one of the current Leads of Workflows. I lean toward allowing that behavior, but would defer to Terry on a decision there.
Pre-requisites
:latest
What happened/what you expected to happen?
We have workflow templates that reference as argument artifact inputs the
.main-logs
output artifact from the a previous step. This arrangement has been working well until a recent upgrade from v3.4.11 to v3.4.12 (same also occurs when upgrading to v3.5.x)The HTTP400 error response content looks like this
The supplied sample workflow will work fine in v3.4.11 and prior but fail in v3.4.12 (and v3.5.x)
Version
v3.4.12
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container