argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.03k stars 3.2k forks source link

v3.5.7+ changes some http response status codes from GET Workflow #13558

Open pmareke opened 1 month ago

pmareke commented 1 month ago

Pre-requisites

What happened? What did you expect to happen?

In the latest images some endpoints change their status responses.

When I call this endpoints:

I start returning a 404 when in previous version (3.4.x) the status code was a 401.

Version(s)

v3.5.7+

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

It's an http call, not related with the workflow itself.

Logs from the workflow controller

Nothing to show.

Logs from in your workflow's wait container

Nothing to show.
agilgur5 commented 1 month ago

Could you give some example API requests to make this easier to reproduce and debug?

As far as I know, the API logic is mostly unchanged in 3.5 other than the Unified Workflows List (i.e. #11121)

pmareke commented 1 month ago

Sure @agilgur5:

It also change the response of the following endpoints:

agilgur5 commented 1 month ago

Some more follow-up questions:

  1. Are the first two also non-existent Workflows?
  2. Do the namespaces exist?
  3. What is the response to the last two? A 200?
  4. Also what auth mode(s) are you using?
pmareke commented 1 month ago
  1. Are the first two also non-existent Workflows?

NO

  1. Do the namespaces exist?

YES

  1. What is the response to the last two? A 200?

Not sure, I could't see, I need to rollback the upgrade as fast as possible 😞.

  1. Also what auth mode(s) are you using?

The Authorization header with the Bearer token.

The most important info I can give you is that without changing nothing on my side a lot of tests, which validates the ArgoWorkflow API responses, start to fail.

Rolling back to the v3.4 version everything back to work again.

agilgur5 commented 1 month ago

NO

Ok, so it's giving a 404 instead of a 401 even when the Workflow exists πŸ€”

Not sure, I could't see, I need to rollback the upgrade as fast as possible 😞.

I imagine you were able to test locally or on a staging/dev env if you tried :latest, so that's where you could repro.

The Authorization header with the Bearer token.

There's two of those; client is Bearer **** and sso is Bearer v2:**** Also you can have multiple enabled, so using one does not preclude the other. That's why I asked, as a response from one auth method could be different in another auth method (see also https://github.com/argoproj/argo-workflows/pull/13372 et al). In particular, server is effectively no auth (if your Server has appropriate RBAC)

Rolling back to the v3.4 version everything back to work again.

Yea I got that, and I did mark this as a regression according to that, I'm just not recalling any intentional change to API responses, so it's likely an unintentional bug somewhere, which is a lot harder to track down 🫠 More information helps with a repro to try to debug that

Joibel commented 1 month ago

My best guess would be this is an unintentional side effect of #13021.

agilgur5 commented 1 month ago

That's v3.5.8+ (#12736 is v3.5.7+) and only affects the list endpoint, vs this is happening on individual workflows, and even on stops. So I would think #11121 would have caused it rather

pmareke commented 1 month ago

Ok, so it's giving a 404 instead of a 401 even when the Workflow exists πŸ€”

Yes.

I imagine you were able to test locally or on a staging/dev env if you tried :latest, so that's where you could repro.

Yes, sure. I'll update again run the tests and give you more info.

There's two of those; client is Bearer **** and sso is Bearer v2:****

In my case I'm using the Bearer **** one.

I'm just not recalling any intentional change to API responses, so it's likely an unintentional bug somewhere, which is a lot harder to track down 🫠

Yes, I was taking a look to the latest releases notes and I didn't see too any change related πŸ˜”.

pmareke commented 1 month ago

Ok so in the following endpoints:

The response now does not include the workflow name as it did in previous versions:

image

But I don't think this is nothing to worry about, my tests are validating the text error and that's an implementation detail and I need to change it.

The changes then are related with:

Taking a look to the code I have the feeling that maybe the changes comes from the cli, does it makes any sense to you?

agilgur5 commented 1 month ago

I have the feeling that maybe the changes comes from the cli, does it makes any sense to you?

No. The CLI doesn't affect responses from the API, it's just a client.

agilgur5 commented 1 month ago

Ok after digging through the blame for the GetWorkflow method, pretty sure I found the root cause. My intuition was roughly correct; there was a precursor to #11121, which was https://github.com/argoproj/argo-workflows/pull/11055 and #11674. Those two PRs mean archive unification apparently affected GetWorkflow too and the logic there is a little incorrect with the archive fallback.

I'll follow-up more tomorrow and this week, there might be some other bugs related to that as well.

agilgur5 commented 1 month ago

Correction, #11674 was correct, but #13021 / #12736 indeed made some alterations to the same exact place apparently, these lines specifically. This does mean it affects v3.5.7+ specifically

I should have a fix out soon