apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.9k stars 3.39k forks source link

[C++][FlightRPC] FlightClient.authenticate is not thread safe #41919

Open jcferretti opened 1 month ago

jcferretti commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

This is the cython code (pyx) implementation of flight_client.authenticate: https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/python/pyarrow/_flight.pyx#L1440

   1440     def authenticate(self, auth_handler, options: FlightCallOptions = None):
   1441         """Authenticate to the server.
   1442
   1443         Parameters
   1444         ----------
   1445         auth_handler : ClientAuthHandler
   1446             The authentication mechanism to use.
   1447         options : FlightCallOptions
   1448             Options for this call.
   1449         """
   1450         cdef:
   1451             unique_ptr[CClientAuthHandler] handler
   1452             CFlightCallOptions* c_options = FlightCallOptions.unwrap(options)
   1453
   1454         if not isinstance(auth_handler, ClientAuthHandler):
   1455             raise TypeError(
   1456                 "FlightClient.authenticate takes a ClientAuthHandler, "
   1457                 "not '{}'".format(type(auth_handler)))
   1458         handler.reset((<ClientAuthHandler> auth_handler).to_handler())
   1459         with nogil:
   1460             check_flight_status(
   1461                 self.client.get().Authenticate(deref(c_options),
   1462                                                move(handler)))

Note the object being stored inside the handler object is returned by to_handler defined later on the same file: https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/python/pyarrow/_flight.pyx#L2501

   2482 cdef class ClientAuthHandler(_Weakrefable):
   2483     """Authentication plugin for a client."""
   2484 
   [...]
   2500 
   2501     cdef PyClientAuthHandler* to_handler(self):
   2502         cdef PyClientAuthHandlerVtable vtable
   2503         vtable.authenticate = _client_authenticate
   2504         vtable.get_token = _get_token
   2505         return new PyClientAuthHandler(self, vtable)

The call to Authenticate on _flight.pyx line 1461 is to grpc_client.cc line 860:

https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/cpp/src/arrow/flight/transport/grpc/grpc_client.cc#L860

    860   Status Authenticate(const FlightCallOptions& options,
    861                       std::unique_ptr<ClientAuthHandler> auth_handler) override {
    862     auth_handler_ = std::move(auth_handler);
    863     ClientRpc rpc(options);
    864     return AuthenticateInternal(rpc);
    865   }

Many calls in the same file use the auth_handler_ member of the struct

https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/cpp/src/arrow/flight/transport/grpc/grpc_client.cc#L1078

that is being assigned on line 862 above, eg, DoPut:

https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/cpp/src/arrow/flight/transport/grpc/grpc_client.cc#L1002

    997   Status DoPut(const FlightCallOptions& options,
    998                std::unique_ptr<internal::ClientDataStream>* out) override {
    999     using GrpcStream = ::grpc::ClientReaderWriter<pb::FlightData, pb::PutResult>;
   1000
   1001     auto rpc = std::make_shared<ClientRpc>(options);
   1002     RETURN_NOT_OK(rpc->SetToken(auth_handler_.get()));
   1003     std::shared_ptr<GrpcStream> stream = stub_->DoPut(&rpc->context);
   1004     *out = std::make_unique<GrpcClientPutStream>(std::move(rpc), std::move(stream));
   1005     return Status::OK();
   1006   }

SetToken on the same file:

https://github.com/apache/arrow/blob/6a28035c2b49b432dc63f5ee7524d76b4ed2d762/cpp/src/arrow/flight/transport/grpc/grpc_client.cc#L87

     86   /// \brief Add an auth token via an auth handler
     87   Status SetToken(ClientAuthHandler* auth_handler) {
     88     if (auth_handler) {
     89       std::string token;
     90       RETURN_NOT_OK(auth_handler->GetToken(&token));
     91       context.AddMetadata(kGrpcAuthHeader, token);
     92     }
     93     return Status::OK();
     94   }

There is a race between calling auth_handler_.get() to get out a raw pointer out of the auth_handler_ shared_ptr and using it inside SetToken, and another thread changing the value of auth_handler_ via the its assignment operator call in Authenticate, which can trigger the deletion of the previously held pointer value. That deletion can happen in another thread after the call to auth_handler_.get() to get the raw pointer value and before SetToken using that raw pointer value.

My team builds a server and client libraries based on flight. One of our customers ran into an issue while using hundreds of concurrent sessions to our service. Our client library was using calls to FlightClient.authenticate triggered by a timer once every 5 minutes; that concurrently with hundreds of calls to DoPut triggered the problem. We were able to reproduce the problem by artificially increasing the frequency of calls to FlightClient.authenticate to 3 seconds; we saw the same problem with either concurrent DoPut or DoGet. We saw the problem on pyarrow 16.0.0 on Ubuntu Linux 22.04.

More details about the symptoms we saw here: https://github.com/deephaven/deephaven-core/pull/5489

Component(s)

FlightRPC

jcferretti commented 1 month ago

Likely related: https://github.com/apache/arrow/issues/38565

kou commented 1 month ago

This will avoid the crash but it seems that it may still use wrong authenticate handler.

jcferretti commented 1 month ago

This will avoid the crash but it seems that it may still use wrong authenticate handler.

By "this" I assume you mean the suggestion I made in the linked issue to change SetToken to take a shared_ptr argument by value instead of a raw pointer.

it may still use wrong authenticate handler.

In what my team is trying to do there is no "wrong" authentication handler. One thread is trying to use the currently registered handler, and another thread is trying to change that handler to a new one, which may or may not use the same authentication token. If the old handler is used, that will use a valid token. If the new handler is used, that will use a valid token (which may or may not be a different authentication token; our server rotates authentication tokens after a period of time, and keeps the old one valid for a while after it has been rotated, because there is no way to prevent several RPCs in flight from different threads from racing with each other, unless the client stopped sending RPCs altogether before trying to get a new token, which it doesn't in our case).

Independently of what my team is trying to do and more generally, if some threads are trying to use the flight client to do operations like DoPut and DoGet, and another thread is changing the authentication handler, there is no way to guarantee an ordering unless the user themselves do some kind of order protection between those threads. What I am trying to elaborate here is that the idea that there is a "right" authentication handler only makes sense if a particular DoGet or DoPut or any other flight operation was done in a context with an expectation on what handler would be used; for that expectation to hold, a user would have to do their own synchronization to ensure that desired ordering, eg, stop doing DoGets or DoPuts or any other operations, change the handler, and then resume. As far as I can tell, there is nothing the flight client itself can, or should do. What I think the flight client can do is to ensure a pointer that was already deleted is not used; since things running from multiple threads can happen in any ordering, if one thread is changing the authentication handler and another is calling DoPut, the thread calling DoPut can't have an expectation on whether the old handler or the new handler will be used. What it can have is an expectation that either of those will be used correctly (as opposed to crashing).

kou commented 1 month ago

By "this" I assume you mean the suggestion I made in the linked issue to change SetToken to take a shared_ptr argument by value instead of a raw pointer.

Right.

our server rotates authentication tokens after a period of time, and keeps the old one valid for a while after it has been rotated

If a server does it and a client is used only for one user, it doesn't use wrong authentication as you said.

But, in general, FlightClient can't assume them. For example, a client may be shared with multiple users.

I don't object this approach because it's better than crashing. Could you open a PR?

But I think that we still need to update our documentation even with this change: https://github.com/apache/arrow/issues/38565#issuecomment-1800131268

jcferretti commented 1 month ago

I don't object this approach because it's better than crashing. Could you open a PR?

Thanks. Done: https://github.com/apache/arrow/pull/41927