argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.77k stars 3.16k forks source link

`header key "x-geoip-city" contains value with non-printable ASCII characters` #12721

Open lucasfcnunes opened 6 months ago

lucasfcnunes commented 6 months ago

Pre-requisites

What happened/what did you expect to happen?

Description

I'm a developer from Brazil and I'm facing a problem with Argo Workflows API. Some of the cities in my country have special characters in their names, such as São Paulo, Belém, or Florianópolis. These characters are part of the diacritical marks that are used to indicate the pronunciation of some letters. However, when I send requests to the Argo Workflows API from these cities, I get an HTTP 500 error. This means that the server encountered an unexpected condition that prevented it from fulfilling the request. I suspect that this is because the API does not handle non-ASCII characters well. This could also affect other languages that use different alphabets, such as Chinese, Japanese, or Hindi. I would like to know how to fix this issue or if there is a workaround for it.

PSs:

Error Message

{"code":13,"message":"header key \"x-geoip-city\" contains value with non-printable ASCII characters"}

image

Version

v3.5.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

NA (web)

Logs from the workflow controller

NA (web)

Logs from in your workflow's wait container

NA (web)
agilgur5 commented 6 months ago

Workaround would be to exclude Argo when adding this header. Normally this would only be selectively added to upstreams that use it, rather than unconditionally added.

Error Message

Are there any other relevant logs from the Server?

I suspect that this is because the API does not handle non-ASCII characters well.

I'd check if it handles them inside of Workflows or inside of user info or something... although I'm not sure these are valid characters in all contexts

This could also affect other languages that use different alphabets, such as Chinese

We do have quite a few Chinese users as well as a couple of Chinese contributors (including one that works on Alibaba Cloud's Managed Argo offering) that haven't reported this error. Also there is a recent PR for i18n support without such an error: #12682

lucasfcnunes commented 5 months ago

A workaround (on ingress-nginx) is adding the annotation nginx.ingress.kubernetes.io/configuration-snippet to the ingress with the proxy_set_header directives to set the x-geoip-city header to something else as follows:

# values.yaml.gotmpl (helmfile/helm)
# (...)
server:
  enabled: true
  ingress:
    enabled: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-production
      external-dns.alpha.kubernetes.io/hostname: &fqdn '{{ .Values.argo_workflows_host }}'
      # TODO: re-add "x-geoip-city" after solving https://github.com/argoproj/argo-workflows/issues/12721
      nginx.ingress.kubernetes.io/configuration-snippet: |-
        proxy_set_header X-GeoIp-City "Unknown";
# (...)

References:

CosmicToast commented 4 months ago

This is caused by https://github.com/grpc/grpc-go/pull/4886. It triggers on some UTF-8 characters, since they go by u8. In particular, this is triggered by tailscale ingress when it reports my name in tailscale-user-name. In this case, it likely triggers because the GeoIp city has UTF8 characters in it.

By my understanding, there is no formal spec forbidding this usage per se, though the HTTP spec says that new headers SHOULD restrict themselves to US-ASCII printables.

CosmicToast commented 4 months ago

As an update, it appears that this actually is argo's fault in the end. The abovementioned verification (which is the only possible source for this error) is in grpc-go's sending component.

In order for this HTTP header to make it over to be sent via grpc and the error displayed on the webui, argo must be blindly copying the received HTTP request to create a server-side follow-up. This is further confirmed by being present in both the webui and the cli.

The action item is to find where exactly grpc requests are created and sanitize them whenever user-input is involved. This is actually a security concern because… users can add/change arbitrary headers when they send requests.

lucasfcnunes commented 4 months ago

@agilgur5

Following what @CosmicToast said, maybe we should:

agilgur5 commented 4 months ago

A workaround (on ingress-nginx) is adding the annotation nginx.ingress.kubernetes.io/configuration-snippet to the ingress with the proxy_set_header directives to set the x-geoip-city header to something else

Regarding the workaround, as I wrote above, I think it would make more sense to not send these headers to all upstreams, and rather to only send certain headers when the upstream supports them. I.e. allowlist upstreams instead of sending unconditionally to all upstreams

agilgur5 commented 4 months ago

Thanks @CosmicToast for investigating this!

In order for this HTTP header to make it over to be sent via grpc and the error displayed on the webui, argo must be blindly copying the received HTTP request to create a server-side follow-up.

The action item is to find where exactly grpc requests are created and sanitize them whenever user-input is involved.

Hmm I did find this line: https://github.com/argoproj/argo-workflows/blob/0b625a61a3473f9045d35a6bc0ae8a99db73e2ca/server/apiserver/argoserver.go#L368

That is unconditional, but it is a mapping from request headers to gRPC client metadata, so I would think only certain headers apply. And those are request headers, not response headers. That seems to have been added in #3488 for webhook support

But I'm not sure and am not that familiar with grpc-gateway (we're also on v1, see also https://github.com/argoproj/argo-workflows/issues/12565#issuecomment-2026688627).

This is actually a security concern because… users can add/change arbitrary headers when they send requests.

That depends if there's no further filtering of headers (e.g. by grpc-gateway itself) and if something uses those headers.

But I'm also not sure there's a blind copy somewhere. argoserver.go is where most of the HTTP and gRPC handling occurs. If you could take a look, could definitely use some more knowledgeable eyes there. Nothing else stands out to me there outside of that line above, which seems fine as it's all request headers/client metadata. I also wonder if grpc-gateway v1 might be the issue.

CosmicToast commented 4 months ago

I can take a look when I have a bit of spare time. I'm not too familiar with argo internals/specifics though.

@agilgur5 could you give me some as-minimal-as-possible set of instructions for me to build and run argo in such a way that I can send it API requests (like the CLI would send), as well as an example of an HTTP request the API might send under normal circumstances?

Then I'll be able to mess with the codebase and test my changes to fully diagnose things while changing headers arbitrarily.

agilgur5 commented 4 months ago

I'm not too familiar with argo internals/specifics though.

The gRPC and HTTP specific handlers are pretty much limited to that single file, argoserver.go, as I wrote above. So fortunately you don't need to know much Argo. Also the rest of the Server, generally speaking, is a relatively straightforward Go CRUD-ish API, mostly located under server/. It's the first place I actually contributed to (#10927)

@agilgur5 could you give me some as-minimal-as-possible set of instructions for me to build and run argo in such a way that I can send it API requests (like the CLI would send)

  1. make cli
  2. ./dist/argo server runs the Server
    • The Server is literally part of the CLI as you can run it locally (and that used to be the only way to run it, which far predates me)
    • also the whole CLI (under cmd/) is a pretty straightforward cobra CLI as well (one of the places we recommend for "good first issues")
    • Note that you probably want to connect the Server to a k8s cluster with a kubeconfig, see the help

as well as an example of an HTTP request the API might send under normal circumstances?

  1. You can send API requests directly through the CLI to the Server, just specify ARGO_SERVER with the server's address (e.g. localhost:2746). If you want to use HTTP directly (vs gRPC), can set ARGO_HTTP1=true. So if you know how to use the CLI, you can continue using it to make things simple.
    • Can also connect to the UI that the Server hosts and use the browser debugging console there.
    • Otherwise I don't manually craft requests (and most of the schema is generated from the protobufs), so I'd have to reverse one if you need it (either manually or via CLI debug logs or via browser console)

Setting debug logging might be helpful for both as well

CosmicToast commented 4 months ago

The above wasn't exactly what I was looking for, so I struggled a little bit in setting it up. I'll give the full reproduction instructions of the type I was looking for before to help everyone else reproduce if needed.

My conclusion is that your analysis was correct while your hunch was not – that line is indeed the cause. As mentioned previously, the error is coming out of the gRPC sending component – this means outgoing request. The flow appears to be like so:

  1. Argo receives an HTTP request. The HTTP request has headers.
  2. grpc-gateway tries to proxy the HTTP request into an outgoing gRPC request. It is aware of the security issues of allowing arbitrary headers through into gRPC client metadata, and therefore filters those using the default filtering rule.
  3. Argo overrides the default filtering rule to allow all headers through "as-is".
  4. If a header contains non-ASCII bytes, it is still let through, and causes the error.

Why this is a security concern:

  1. Any time you allow arbitrary user data to an internal service, this is a security concern. For example, you can never know how the parser might react. This isn't too serious, but then...
  2. Since Argo allows any header through, it will also allow reserved headers starting with grpc-. These are often used for authentication and other internals. The spec explicitly disallows sending those.

Reproduction case:

  1. Get minikube started. Install it and run minikube start. This is required because argo will refuse to run locally without a connected kubernetes cluster.
  2. kubectl apply -f manifests/quick-start-minimal.yaml – this is required to set up everything argo needs to run. It will also launch an upstream version of argo in the cluster, but we don't care about it.
  3. We do not need to set up any authentication to demonstrate this. Otherwise, this is when it would be done.
  4. make cli in order to build the argo cli and be able to start the server locally.
  5. ./dist/argo server -n argo -v to launch the server.
  6. We can now reproduce: curl -k https://localhost:2746/api/v1/workflows/argo -H grpc-test:é. This will return error code 13 with the abovementioned description.

This affects all headers. Another example is curl -k https://localhost:2746/api/v1/workflows/argo -H any-header:é. However, the acceptance of grpc- headers is (as mentioned previously) a security concern. If I was less lazy I would probably file a CVE for it, but I really don't want to do that on a saturday.

If we comment out the line you mentioned, we get the expected Unauthorized / Token Not Valid error.

Note that since the problem is with the values and the grpc-gateway code does not allow us to filter on those (but only the keys), there is no possible way to deny incorrect headers. To resolve this, Argo must allowlist any headers that may be needed by the downstream gRPC consumer (I have no idea what that is, but whatever it is, it may be compromised otherwise).

As a bare minimum, anything starting with grpc- should be disallowed for security purposes, but that wouldn't resolve this issue. For this issue, a strict allowlist should be set. It should also be investigated as to whether the line is actually required for webhooks (I have doubts as to this).

agilgur5 commented 4 months ago

The above wasn't exactly what I was looking for, so I struggled a little bit in setting it up.

Sorry about that! You said "as-minimal-as-possible" so I skipped all the k8s steps and assumed you had a cluster available.

My conclusion is that your analysis was correct while your hunch was not – that line is indeed the cause.

2. [...] arbitrary headers through into gRPC client metadata

So this is client metadata, that's why I thought it shouldn't affect it. If you talk to the gRPC server directly you should be able to add the same client metadata (without needing to go through an HTTP conversion), no?

As mentioned previously, the error is coming out of the gRPC sending component – this means outgoing request.

Right, if this is the case, that just means we're allowing conversion of HTTP headers to improper gRPC client metadata. That's a small bug, but not a security issue; the gRPC server is already accessible directly to any clients. We cannot control what clients do, and they can pass buggy client metadata too. So long as the gRPC server's logic isn't compromised with an arbitrary copy, there isn't a security issue.

2. [...] it will also allow reserved headers starting with grpc-. These are often used for authentication and other internals. The spec explicitly disallows sending those.

I'd ask for some links to this as I'm not super familiar with gRPC, but if this is a security issue, it should really go through disclosure and we should avoid adding more details in the public thread.

The link I provided above to the grpc-gateway docs does not state any such security issue, says "client" and "request headers", and even says "to pass through all the HTTP headers" (although the example doesn't quite do that). The default header matcher also explicitly allows Grpc-Metadata-* headers.

The gRPC Metadata docs I found does mention the spec disallowing grpc- and usage for authentication, but again, those are client headers -- the clients already have access to those headers. Allowing grpc- through would be a bug in the conversion, although I'd be surprised if gRPC still allowed it even with the override; I'd expect it to either ignore them or send an error response.

If I was less lazy I would probably file a CVE for it, but I really don't want to do that on a saturday.

If this is a possible security issue, we should go through disclosure. Even if it turns out not to be, the entire purpose of the process is to not leak potentially harmful details publicly and coordinate properly.

Notably I'm a finder, reporter, and patcher myself and have experience running a security team (InfraSec, AppSec, and ITSec, although the majority of my experience is in InfraSec). I also ensured Draft GHSAs were supported in this repo recently (#12747 et al).

If you're too lazy to do that, I can't really stop you, but probably the simplest alternative I can offer is to DM me on CNCF Slack where I can do the coordination etc myself (and would still credit you as finder).

  1. Any time you allow arbitrary user data to an internal service [...]

For this issue, a strict allowlist should be set. It should also be investigated as to whether the line is actually required for webhooks (I have doubts as to this).

To be clear, I don't disagree with any of this, this is all pretty standard security procedure. I didn't write the code though and it's a few years old, so now we have the more complex problem of fixing something without breaking intended usage. If this is a security issue, it's even more problematic with something that old.

The original PR #3488 is not very detailed, but the original issue #2667 does have some more details. In particular, I found this public Google Doc that immediately lists a few different headers to support arbitrary providers.

If we don't know all the headers beforehand, the simplest fix would be to have an env var allowlist on the Server of possible headers. We should also technically filter that of any explicitly disallowed headers. That would still be breaking and may potentially break things other than webhooks (as it's been around for a while now and could be used for many things now unfortunately), but is small enough in scope to get through if we don't have any better options.

CosmicToast commented 4 months ago

So this is client metadata, that's why I thought it shouldn't affect it.

Right it seems like there's some degree of misunderstanding. Let me clarify a bit more here. The normal logic pathway goes like so:

  1. HTTP request goes to Argo. This contains a request header.
  2. Before reaching argo, it goes through TLS termination and reverse proxying.
  3. The reverse proxy inserts headers (in this case, X-GeoIP-City, but there's other examples available too).
  4. Argo receives the HTTP request and copies all request headers unconditionally into outgoing client metadata.
  5. This fails, because some HTTP request headers may contain non-ascii-printable data.

The problematic threat model goes something like this: if argo is exposed but kubernetes is not (presuming kubernetes is the gRPC endpoint that argo is communicating with post-factum, if it isn't, then whatever that is), you can potentially mess with its internals as an unprivileged user by making argo pass through restricted client metadata. Additionally, it's possible for users to get MITMd with a similar attack.

If you talk to the gRPC server directly you should be able to add the same client metadata (without needing to go through an HTTP conversion), no?

Correct! However, this is not always true. Some users might expose argo without exposing the underlying gRPC server under the assumption that argo will perform validation and cleanup internally, which is the action item.

I'd ask for some links to this as I'm not super familiar with gRPC, but if this is a security issue, it should really go through disclosure and we should avoid adding more details in the public thread.

It's not super high priority, imo, since the set of people that might be affected isn't THAT high. And disclosure has essentially already happened when this issue was created (since there's enough information there to figure this all out :) ). You can also choose to see it as a weakening of the security perimeter, which is similarly not that big a deal. Just, you know, something that should be patched.

The link I provided above to the grpc-gateway docs does not state any such security issue, says "client" and "request headers", and even says "to pass through all the HTTP headers" (although the example doesn't quite do that). The default header matcher also explicitly allows Grpc-Metadata-* headers.

Indeed, the example is quite different and actually fine.

As for the default matcher, it explicitly allows Grpc-Metadata-* headers, but then removes grpc-metadata- from them. So a grpc-metadata-foo header will become the foo client metadata. This is quite important :)

If you're too lazy to do that, I can't really stop you, but probably the simplest alternative I can offer is to DM me on CNCF Slack where I can do the coordination etc myself (and would still credit you as finder).

I don't really use slack outside of work, and you do technically have all of the information needed here. If you really want to have a further discussion in private, feel free to email me. FWIW my background is in infrastructure architecture and systems development, so I have very different tolerances, though my spouse was a cybersec researcher and mostly agrees.

If we don't know all the headers beforehand, the simplest fix would be to have an env var allowlist on the Server of possible headers. We should also technically filter that of any explicitly disallowed headers.

This sounds like the correct and complete solution. I would have a BREAKING changelog notification, with text along the lines of:

If you use Argo Workflows with webhooks, you may need to modify the X environment file, as HTTP headers will no longer be passed through as-is by default.
See further details in [issue] and [pr].
agilgur5 commented 4 months ago

The problematic threat model goes something like this: if argo is exposed but kubernetes is not (presuming kubernetes is the gRPC endpoint that argo is communicating with post-factum, if it isn't, then whatever that is)

Correct! However, this is not always true. Some users might expose argo without exposing the underlying gRPC server

Perhaps another misunderstanding -- the underlying gRPC server is Argo itself. Argo is natively gRPC and uses grpc-gateway to add an HTTP interface to itself.

As such, anything the gRPC server exposes is intended to be exposed by the HTTP interface as well.

Does that explain why I don't see this as a security issue at this time? To me it seems like just a HTTP -> gRPC conversion bug, which is substantially lower priority.

(Also the k8s apiserver does not support gRPC clients (see also https://github.com/kubernetes/kubernetes/issues/87421), so Argo couldn't connect to it that way even if it wanted to or erroneously tried to.)

As for the default matcher, it explicitly allows Grpc-Metadata-* headers, but then removes grpc-metadata- from them. So a grpc-metadata-foo header will become the foo client metadata. This is quite important :)

Yes, it also does the opposite as well. Or, in other words, that very much seems like a correctness issue, not a security issue.

I would have a BREAKING changelog notification, with text along the lines of:

Yea of course, the primary difference is that if there's a security issue, we'd backport it as a patch, regardless of it being breaking. Whereas, if it's a plain bug, I might opt to take an incremental approach where some parts can be backported (e.g. explicit disallows, correctness fixes) and others are left to a breaking minor release (e.g. required allowlist).

CosmicToast commented 4 months ago

Perhaps another misunderstanding -- the underlying gRPC server is Argo itself.

Oh indeed that is a misunderstanding then. I was thinking Argo made requests to the Kubernetes CRI shim and was sending that over. This indeed would have been problematic, but as it is I agree with you – it's merely a service degradation against perfectly valid HTTP requests.

Yes, it also does the opposite as well.

Looking at it, it also has a very fun side-effect of Grpc-Metadata-Grpc-Something would similarly forward Grpc-Something 😄. That doesn't seem very good to me (and may be more applicable to applications other than Argo, but based on your input above we indeed don't care).

In summary, I do apologize for my misunderstanding of what the underlying gRPC server is. In this context, if you can reach Argo by HTTP, then you can also reach it by gRPC, and there is no security concern. Though this may not hold for some other users of grpc-gateway, I'm not even close to familiar enough with it to give any serious commentary. Maybe to be brought up with them later (?)

agilgur5 commented 4 months ago

I was thinking Argo made requests to the Kubernetes CRI shim

No, Argo doesn't touch the CRI, that's lower-level than Argo's internals. The Executors doing a bit of process mangling is the lowest it goes, and we try to limit its dependence when possible (one reason why the Executor has evolved a bit over time).

Looking at it, it also has a very fun side-effect of Grpc-Metadata-Grpc-Something would similarly forward Grpc-Something 😄. That doesn't seem very good to me (and may be more applicable to applications other than Argo, but based on your input above we indeed don't care).

Yea that does seem overly simplistic and potentially a bug in grpc-gateway. Though that's partially why I assume the gRPC Server itself will deny requests with grpc- in them; i.e. it's reserved on the server-side in some way. But I'm not sure.

Though this may not hold for some other users of grpc-gateway, I'm not even close to familiar enough with it to give any serious commentary. Maybe to be brought up with them later (?)

Yea, I would imagine gRPC folks would consider that a user error if someone deployed it to connect to a different service that isn't exposed and ended up exposing things they shouldn't. But there's potentially mitigating docs or configs they could add in such cases. But I'm definitely speculating outside of my currently limited gRPC knowledge.

lucasfcnunes commented 2 weeks ago

@agilgur5 @CosmicToast

Do you guys know how to solve this?

agilgur5 commented 2 weeks ago

The workaround you mentioned above works and the one I mentioned just above it is appropriate. This isn't an Argo header and isn't a security issue, so it's very low priority