Open pmareke opened 1 month ago
Could you give some example API requests to make this easier to reproduce and debug?
As far as I know, the API logic is mostly unchanged in 3.5 other than the Unified Workflows List (i.e. #11121)
Sure @agilgur5:
/workflows/{self.namespace}/{workflow_name}
sending an invalid Bearer token -> 401 in v3.4.x and 404 in v3.5.x/workflows/{self.namespace}/{workflow_name}
sending a valid Bearer token without enough permissions -> 401 in v3.4.x and 404 in v3.5.xIt also change the response of the following endpoints:
/workflows/{self.namespace}/{workflow_name}/resume
with a valid token and a non existing workflows -> 404 in v3.4.x/workflows/{self.namespace}/{workflow_name}/stop
with a valid token and a non existing workflows -> 404 in v3.4.xSome more follow-up questions:
- Are the first two also non-existent Workflows?
NO
- Do the namespaces exist?
YES
- What is the response to the last two? A 200?
Not sure, I could't see, I need to rollback the upgrade as fast as possible π.
- Also what auth mode(s) are you using?
The Authorization header with the Bearer token.
The most important info I can give you is that without changing nothing on my side a lot of tests, which validates the ArgoWorkflow API responses, start to fail.
Rolling back to the v3.4 version everything back to work again.
NO
Ok, so it's giving a 404 instead of a 401 even when the Workflow exists π€
Not sure, I could't see, I need to rollback the upgrade as fast as possible π.
I imagine you were able to test locally or on a staging/dev env if you tried :latest
, so that's where you could repro.
The Authorization header with the Bearer token.
There's two of those; client
is Bearer ****
and sso
is Bearer v2:****
Also you can have multiple enabled, so using one does not preclude the other. That's why I asked, as a response from one auth method could be different in another auth method (see also https://github.com/argoproj/argo-workflows/pull/13372 et al). In particular, server
is effectively no auth (if your Server has appropriate RBAC)
Rolling back to the v3.4 version everything back to work again.
Yea I got that, and I did mark this as a regression according to that, I'm just not recalling any intentional change to API responses, so it's likely an unintentional bug somewhere, which is a lot harder to track down π« More information helps with a repro to try to debug that
My best guess would be this is an unintentional side effect of #13021.
That's v3.5.8+ (#12736 is v3.5.7+) and only affects the list endpoint, vs this is happening on individual workflows, and even on stops. So I would think #11121 would have caused it rather
Ok, so it's giving a 404 instead of a 401 even when the Workflow exists π€
Yes.
I imagine you were able to test locally or on a staging/dev env if you tried
:latest
, so that's where you could repro.
Yes, sure. I'll update again run the tests and give you more info.
There's two of those;
client
isBearer ****
andsso
isBearer v2:****
In my case I'm using the Bearer ****
one.
I'm just not recalling any intentional change to API responses, so it's likely an unintentional bug somewhere, which is a lot harder to track down π«
Yes, I was taking a look to the latest releases notes and I didn't see too any change related π.
Ok so in the following endpoints:
PUT /workflows/{self.namespace}/{workflow_name}/resume
with a valid token and a non existing workflow.PUT /workflows/{self.namespace}/{workflow_name}/stop
with a valid token and a non existing workflow.The response now does not include the workflow name as it did in previous versions:
But I don't think this is nothing to worry about, my tests are validating the text error and that's an implementation detail and I need to change it.
The changes then are related with:
GET/workflows/{self.namespace}/{workflow_name}
sending an invalid Bearer token -> 401 in v3.4.x and 404 in v3.5.xGET /workflows/{self.namespace}/{workflow_name}
sending a valid Bearer token without enough permissions -> 401 in v3.4.x and 404 in v3.5.xTaking a look to the code I have the feeling that maybe the changes comes from the cli, does it makes any sense to you?
I have the feeling that maybe the changes comes from the cli, does it makes any sense to you?
No. The CLI doesn't affect responses from the API, it's just a client.
Ok after digging through the blame
for the GetWorkflow
method, pretty sure I found the root cause. My intuition was roughly correct; there was a precursor to #11121, which was https://github.com/argoproj/argo-workflows/pull/11055 and #11674. Those two PRs mean archive unification apparently affected GetWorkflow
too and the logic there is a little incorrect with the archive fallback.
I'll follow-up more tomorrow and this week, there might be some other bugs related to that as well.
Correction, #11674 was correct, but #13021 / #12736 indeed made some alterations to the same exact place apparently, these lines specifically. This does mean it affects v3.5.7+ specifically
I should have a fix out soon
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
In the latest images some endpoints change their status responses.
When I call this endpoints:
I start returning a 404 when in previous version (3.4.x) the status code was a 401.
Version(s)
v3.5.7+
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container