canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

Race between untrusted operations and events websocket listener #12994

Open masnax opened 7 months ago

masnax commented 7 months ago

An issue causing intermittent test failures in the clustering image refresh tests was fixed by https://github.com/canonical/lxd/pull/12754/commits/e7b3c92f2c5ea0b9de9668f06a17a3fa77b239c0

The problem was a race where an operation would complete after setting up the events websocket listener, but before the listener is added to the event handler, and so we would block while waiting for the operation complete event. The fix works by reducing the time between the creation of the listener and adding it to the handler.

While this works, I think it is a bit obscure and overly couples the client with the order of operations on the server. It made more sense to me to just ensure we set up the client and all its listeners before we make any request to create an operation, which those listeners are meant to track. While testing this over here: https://github.com/canonical/lxd/pull/12885 I found a separate race that is connected to the one solved above, resulting in this error: https://github.com/canonical/lxd/actions/runs/7920353180/job/21623203927?pr=12885#step:13:15409

This happens if we run lxc image copy localhost:${fingerprint} lxd2: --mode=pushto copy an image across 2 remotes.

Internally, we make the following POST /1.0/images/{fingerprint}/export to localhost here. From this endpoint we make a CreateImage request to lxd2 without a client keypair. The request is untrusted since it's run from the server, but when we add lxd2 as a remote, it only trusts the client, not the server.

The race comes in here when CreateImage performs the request which spawns an operation on lxd2. Around this time, it also sets up an event listener which requires trust. If the operation completes before we set up the listener then everything works fine, but if the listener manages to start up first, then the request will fail with an unauthorized error.

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

tomponline commented 7 months ago

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

This makes sense. Although I am wondering why the image copy succeeds if the two servers dont trust each other and there is no temporary key?

masnax commented 7 months ago

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.

This makes sense. Although I am wondering why the image copy succeeds if the two servers dont trust each other and there is no temporary key?

There is a secret set in the request header or operation metadata which is manually verified in the corresponding images and operations endpoints, if that's what you mean by using a temporary key.

I looked into handling the secret in the events endpoints as well, but I couldn't figure out an elegant solution for verifying it. Best I thought of is to intercept any events going through the websocket and only propagate the events which contain an operation that has the secret in its metadata. That seems overly complex to me.

tomponline commented 7 months ago

@masnax ok in that case im not following what you mean by:

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair

masnax commented 7 months ago

@masnax ok in that case im not following what you mean by:

I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair

The client function which spawns the racy event listener is CreateImage. It sets up an http.Request with the secret in its header. Performing the request spawns an operation.

The request that spawns the operation bypasses the the normal auth process of comparing the request peer certs to the locally trusted certificates, and instead verifies the secret, which was sent to that server beforehand.

CreateImage returns an operation struct. When we call operation.Wait, we create an events websocket listener which will listen for any type of event (logging, operations, etc). This listener hits the /1.0/events endpoint which only supports the traditional auth method.

masnax commented 7 months ago

So what I'm suggesting is to not leave spawning the listener until after the operation is created and we call operation.Wait, but instead create it before we make the request that spawns the operation. We do this under the constraint that the underlying ProtocolLXD actually has a client keypair assigned to it, and otherwise set skipListener to true.

tomponline commented 7 months ago

@masnax ok lets have a look at a PR that does this when you get a chance. Thanks

tomponline commented 6 months ago

Filtering events by what the user can see, ala https://github.com/lxc/incus/issues/292 would be helpful here.

@markylaing says

Ultimately, filtering events by what the user is able to view would solve the PR. Perhaps we can add an operation_secret query parameter and allow untrusted requests on the events endpoint.