Cert injection is not done properly in authn-k8s

orenbm commented 4 years ago

Summary

Currently, we close the WebSocket connection after we send the injection request to kubectl. This is not correct as this way we do not wait for kubectl to send its response.

This has 2 implications:

If the injection request has succeeded we will return a 200 OK response to the 'inject_client_cert' request even if the cert has not finished to be written to the pod. This can lead to a race-condition where the client will try to read the cert before it is there.
If the injection request has failed in the k8s API, we will not write the error to the Conjur logs, and we will falsely return a 200 OK response to the 'inject_client_cert' request

We need to verify that the cert is written to the client's container before returning a 200 OK response.

Steps to Reproduce

Steps to reproduce the behavior:

Simulate an error on the client that will fail the injection (e.g do not give the user permissions on /etc/conjur/ssl)
Run the authn-client
See that you still got a 200 OK response from inject_client_cert although the cert was not copied.

Expected Results

The inject_client_cert request fails if the cert injection fails (and logs are written in the server indicating the reason)
The inject_client_cert request succeeds only after the cert is written

Actual Results (including error logs, if applicable)

The inject_client_cert request succeeds fast.

Reproducible

[x] Always

eladkug commented 4 years ago

This bug should be fixed as part of this bug

izgeri commented 4 years ago

@orenbm won't waiting to return 200 until the cert injection is complete cause the API request to hang?

Our current design is basically (as I understand it):

Receive client cert injection request
Verify it's a valid request
Return 200 if valid, 40x if not
Kick off client cert injection out of band
Client checks for cert if it received a 200 response

I think if we're seriously considering a better pattern than our current one we should consider something like what the Open Service Broker spec uses:

Receive client cert injection request
Verify it's a valid request
Return 202 Accepted (see example) if valid, 40x if not
Kick off client cert injection out of band, and store info about state of request (TBD how)
Client sends requests to polling route to determine if request was successful (see [example[(https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#polling-last-operation)), proceeds when it receives a 200 or 400 response

If we are going to revise the flow here, I think @cyberark/aam-architects should weigh in on the impact of having the API hang mid-request.

orenbm commented 4 years ago

thanks @izgeri . Actually the suggested approach here is the current one, at least as seen in the code. Unless i misunderstand the code, we verify that the cert is injected before returning 200 OK. the issue is that we are not doing it properly. you can see the flow here. let me know if you understand it different that me. That's why i have treated this as a bug and not as a behaviour change.

However, i am open to re-thinking this approach and designing it better to not wait for a response but to write the errors to the log nevertheless (which is not happening today).

cyberark / conjur