Closed fmiguelez closed 1 year ago
We have tested this on 2.7.0, 2.7.1 and 2.7.2
Nice description and proposal Would you have time to contribute an implementation?
@eolivelli I tried looking for exact piece of code where this happens but I find code a little bit difficult for me to follow (at least on github). Probably I can find a solution with some help from someone pointing me in the right direction.
The issue had no activity for 30 days, mark with Stale label.
@nodece my understanding is this issue was not fully fixed with https://github.com/apache/pulsar/issues/17517 and https://github.com/apache/pulsar/pull/17831. Is it still pending on https://github.com/apache/pulsar/pull/18130?
Is it still pending on https://github.com/apache/pulsar/pull/18130?
@gengmao Right.
Is it still pending on #18130?
@gengmao Right.
@nodece @codelipenghui I think that there's a major gap with the solution in #17517, #17831 or #18130 if the intention is to fix token refresh with the Pulsar Proxy.
One detail of the proxy is that after the proxy has connected, it switches to a mode where plain bytes are proxied without decoding the protocol messages.
This is where the frameDecoder is removed: https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/DirectProxyHandler.java#L447-L449
There's no information when a Pulsar binary protocol message starts and ends, and it's not possible to intercept or inject any messages between the client and the broker in the current solution. If any messages are injected, there's a chance that it corrupts the traffic since the message might get injected in a middle of another message since the proxy isn't doing any framing.
Solving this issue will be tricky to solve (while keeping the same performance characteristics). It needs more design discussions and documenting the design.
It feels that there's no real need to inject traffic in the proxy or do any special token refresh. The proxy stops authenticating and simply passes the bytes back and forth after the connection has been established.
Older changes #1707 and #1169 could provide more context.
Thank @lhotari for your explanation.
One detail of the proxy is that after the proxy has connected, it switches to a mode where plain bytes are proxied without decoding the protocol messages.
I know this, so the proxy authentication data should not contain an expiration time, and then just refresh the client authentication data, and we also should notice we have a client on the proxy, which is used to lookup the topic URL and metadata, sometimes the client is ignored.
https://github.com/apache/pulsar/pull/17831 didn't refresh the proxy's client authentication data, here should be the user's authentication data.
The proxy's client passes two authentication data, one from the self, and one from the actual user.
https://github.com/apache/pulsar/pull/18130 fixes a bug that updates the authentication variable, this should be an important fix.
I agree with @lhotari that the solution in https://github.com/apache/pulsar/pull/17831 will lead to corruption.
To take a step back, I think we need to clarify what the actual issue. Once the client's connection through the proxy is established with the broker, the broker's auth refresh challenge protocol takes over, assuming everything is configured correctly.
What appears to be the problem is that the clientAuthData
in the ProxyConnection
is never updated. That means the clientCnxSupplier
will have stale data when making connections to new brokers. I'll try to put together a PR to show a solution.
I agree with @lhotari that the solution in #17831 will lead to corruption.
I am not sure that I think this any more.
The only connections that appear to be the problem are the ones with a ProxyLookupRequests
state. Notably, #17831 only applies to these connections:
I think there is still an additional issue we'll need to solve. I need to sign off now, but I have a partial fix and will continue working on it in a few hours.
One problem I see is that the auth data is hard coded:
As a result, when a client connects for a lookup with an existing client/proxy connection but triggers a new proxy/broker connection, the auth data may be expired.
I have a draft solution for this issue here: https://github.com/apache/pulsar/pull/19026. I am still working on a test.
I know this, so the proxy authentication data should not contain an expiration time, and then just refresh the client authentication data, and we also should notice we have a client on the proxy, which is used to lookup the topic URL and metadata, sometimes the client is ignored.
@nodece When the proxy switches to the mode where plain bytes are proxies, the proxy should never use the authentication data anymore. This is why it's unnecessary to keep it up-to-date after the proxy switches the connection to proxying mode.
What would be helpful is to describe the scenario and sketch sequence diagrams of the scenario where refreshing of the authentication tokens would be needed.
We don't currently have a sufficient description of the problem so that there would be a reasonable solution to the problem.
The connection pool in the proxy is bound to the connection from the client to the proxy. On the client side, the connection from the client to the proxy is keyed by the "logical address" and the "physical address". Java client: https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionPool.java#L189-L207 Pulsar Go client: https://github.com/apache/pulsar-client-go/blob/d92fb1407d3d39c8a498dd7c7abdc0bbb3fc7e1a/pulsar/internal/connection_pool.go#L75-L76 Pulsar C++ client: https://github.com/apache/pulsar-client-cpp/blob/fa3ac76eddb08c3eb9d865332214af5aa5a5fe88/lib/ConnectionPool.cc#L64-L65
Because of this reason, the connection pool held within the ProxyConnection should contain only a single connection to do lookups. https://github.com/apache/pulsar/blob/f3608074f537471545926a84df8a1d061b30692c/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java#L86-L87
I wonder if we could come up with a better description of the problem and the solution.
I fixed the lookup action by https://github.com/apache/pulsar/pull/17831.
When the proxy switches to the mode where plain bytes are proxies, the proxy should never use the authentication data anymore. This is why it's unnecessary to keep it up-to-date after the proxy switches the connection to proxying mode.
You're referring to the final process, not lookup process, there are two connections between connect to broker
and lookup
.
I fixed the lookup action by #17831.
@nodece Please add the proper problem description to PR #17831. It will be easier to review the PR when there's a clear description of the context.
@nodece since #17831 is already merged, I guess PR #18130 is where there should be proper problem description. Please improve it.
You're referring to the final process, not lookup process, there are two connections between
connect to broker
andlookup
.
@nodece I'm not exactly sure what you mean with this comment.
The ProxyConnection is one of these states:
enum State {
Init,
// Connecting between user client and proxy server.
// Mutual authn needs verify between client and proxy server several times.
Connecting,
// Proxy the lookup requests to a random broker
// Follow redirects
ProxyLookupRequests,
// Connecting to the broker
ProxyConnectingToBroker,
// If we are proxying a connection to a specific broker, we
// are just forwarding data between the 2 connections, without
// looking into it
ProxyConnectionToBroker,
Closing,
Closed,
}
Yes, there's a separate connection for lookups and after the state goes to ProxyConnectingToBroker/ProxyConnectionToBroker, the connection(s) used for lookups, will no longer be needed.
This is what I meant with my comment "When the proxy switches to the mode where plain bytes are proxies, the proxy should never use the authentication data anymore. This is why it's unnecessary to keep it up-to-date after the proxy switches the connection to proxying mode."
I see that you know this, but my point is that it's not described in the PR descriptions. It's really hard to guess what the intention is if there's no code comments or description in the PR of the intent.
For example, iterating through the connections in the connection pool (used for lookup connections) doesn't seem like the correct approach at all: https://github.com/apache/pulsar/pull/17831/files#diff-bbca5aac9dede7618187e91b91ffd0c6c8ffb836cd79ff2d104439e8cf5fc0daR545-R561
Perhaps this works for most cases where there's a single connection in the connection pool. What if there are multiple connections? It would also be helpful to add integration tests where this would happen.
Thank you for your detailed explanation and guidance.
Perhaps this works for most cases where there's a single connection in the connection pool. What if there are multiple connections? It would also be helpful to add integration tests where this would happen.
This scenario would be one client connecting different brokers by proxy, which means they use the same authentication data on the broker, and I send the new authentication data to each broker, maybe there is a performance issue here.
I identified an addition issue with #17831. Specifically, we go to Connecting
state when the broker challenges the client's auth. That is a problem because it could lead to the client's connection getting closed by the proxy if the proxy receives an "unexpected" protocol message.
Here is my comment on the PR: https://github.com/apache/pulsar/pull/17831#discussion_r1080836860
Describe the bug We have Pulsar deployed as a single cluster with a Pulsar Proxy, 3 brokers, 3 bookies and 3 zookeepers.
We use token authentication. Tokens expire every 5 minutes and they are refreshed 1 minute before they expire. We use
Supplier
interface inPulsarClient
to provide a valid token.The thing is that if we connect directly to brokers token refreshing works perfectly (we use a Docker Swarm and deploy broker as a service with multiple instances, so swarm manages balancing of connections). No
ClientCnx
erros are displayed (see #8922) and newReader
,Consumer
orProducer
can be created from same client instance.Te problem comes when we use Pulsar Proxy. It seems that Pulsar Proxy keeps a connection pool against the broker while it manages connections from Pulsar Client. Connection pool against brokers eventually requires creating new connections, same as connections between Client and Proxy. Connection between Client and Proxy always has valid tokens, since when a new connection in pool is required
Supplier
inPulsarClient
returns a currently valid token.The connection between the proxy and the brokers seems to be another story as when at least 5 minutes have elapsed since first token was used for first
Producer
,Reader
orConsumer
new connections may fail between Proxy and brokers. But not in all cases.Below are some logs from a custom test that consumes data from a topic using a
Reader
and produces data every 30 seconds (using a newProducer
each time). The token is refreshed 2 times but after second one creating newProducer
fails.Look at
RefreshableCredentialsSupplier
logs that inform when a token is refreshed and after it is destroyed when was it last accessed byPulsarClient
.If we look at corresponding broker logs (broker3) we see that connection is being stablished using an expired token (at a time AFTER the refresh):
Notice the time 2021-06-03T10:46:56Z that is the expiration time of the second token. First refresh worked OK as broker did not fail past first 5 minutes. Error occurs on broker at
12:47:47.109
reporting an expired token at 2021-06-03T10:46:56Z. However that token had already been refreshed one minute before at2021-06-03 12:45:59.301
(local time is CEST or GMT+2):After that moment following producers were created at following moments (using third token)
2021-06-03 12:46:11.809
,2021-06-03 12:46:43.532
,2021-06-03 12:47:15.302
. However at2021-06-03 12:47:38.882
client reported connection failed (using second token) for error occurring on broker at12:47:47.109
.Expected behavior
All previous tokens belonging to same client must be refreshed at proxy connection pool once they are obtained through connections from client.
This is a blocker issue that prevents us from using Pulsar Proxy on our clusters.