Open HaraldGustafsson opened 6 years ago
Now verified same issue on a full Pulsar setup.
I think the problem might be related to concurrent updates on namespace policy. we can probably handle the bad version and retry, or put all the updates on same namespace in ordered executor.
Yes, the zookeeper bad version implies that pulsar have concurrent updates. At least a retry is needed. Don't know if a queue would help, do you only have one entry point? Otherwise, you would still have potential concurrent updates. Harald
Hi, Just a comment is it not normal to give status service unavailable 503, for a retry scenario. Status conflict implies that the client has sent a faulty request, which is not the case, and sometimes it could be hard to distinguish this conflict from an actual conflict, and a need to still parse response text. Or is this zookeeper conflict handled inside pulsar with a retry. Harald
On Fri, 28 Dec 2018, 19:15 Sijie Guo <notifications@github.com wrote:
Closed #2952 https://github.com/apache/pulsar/issues/2952 via 1fa81ca https://github.com/apache/pulsar/commit/1fa81ca54f688938c83a0be3c98bebc7589be668 .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apache/pulsar/issues/2952#event-2047561931, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKqXSU-BC9QxLxFu9KkjppUfPNwcDxzks5u9l_dgaJpZM4YSXhe .
@HaraldGustafsson At the time of bad version happening, it means there are concurrent updates to namespace policy. It is hard to determine what is the real conflict and what conflict it is. so returning CONFLICT status code sounds a right approach to me. IMO, Pulsar applications which are using pulsar admin api should be able to catch this exception and retries, because pulsar applications are better places to resolve the conflict. Does that align with what you are thinking?
I understand the reasoning, but from the api it is not obvious that when you change permissions on different topics in the same namespace that you could get this concurrent conflict (and should retry), if you don't know that all topics permissions are stored in the namespace. At least I did not know it before digging a bit deeper. When realising this I needed to queue such operations on the client side and add retry since multiple workers might attempt to do changes. So when setting up many topics, with several configurations like retention, ttl, permissions, etc, they all need to be queued, and get rtt delay. In my setup, for each admin request we get we need to queue around 10 admin requests to pulsar, with retries even more. Would it be possible to introduce a larger namespace policy update method to change multiple things simultaneously, since in my case the conflict was often due to changing several things in the ns policy.
On Fri, 28 Dec 2018, 19:49 Sijie Guo <notifications@github.com wrote:
@HaraldGustafsson https://github.com/HaraldGustafsson At the time of bad version happening, it means there are concurrent updates to namespace policy. It is hard to determine what is the real conflict and what conflict it is. so returning CONFLICT status code sounds a right approach to me. IMO, Pulsar applications which are using pulsar admin api should be able to catch this exception and retries, because pulsar applications are better places to resolve the conflict. Does that align with what you are thinking?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/pulsar/issues/2952#issuecomment-450408555, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKqXf-Nb3u2i7X88mY23dJtaugbUCD_ks5u9metgaJpZM4YSXhe .
@HaraldGustafsson I see your point now. Let me reopen this issue and think through a better solution on addressing such conflicts.
@HaraldGustafsson for my point of view, retry could not a good solution, due to retry also could fail again. Maybe a queue/or only one request at one time is a better idea, what I know for kafka-manager to change policy does like this .
@sijie what do you think about this ?
@foreversunyao -
Maybe a queue/or only one request at one time is a better idea, what I know for kafka-manager to change policy does like this .
I was thinking about same thing. that sgtm
@sijie I'll try to add one global write lock for this request first, I think it could be enough for change policy.
@foreversunyao sgtm
@sijie for this issue, I am considering create a persistent zknode in zk if not exists, zknode name is the namespace name . by this way to implement a lock for change permission. But I have a problem which path I should use, what do you think about this ?
current path: [zookeeper, counters, ledgers, managed-ledgers, schemas, namespace, admin, loadbalance]
@HaraldGustafsson @sijie @merlimat , Hi, I made a pr https://github.com/apache/pulsar/pull/3398/ for this issue , tried to add a zknode for mutex(to be honest, it's only a short way like queue size is one, could not a best way).
Right now, what I think we could have 2 other ways to do retry thing except this one:
any ideas ? thanks
Any news here? I have topics where there are ~10 different roles with different permissions. I manage them with pulsar terraform provider which basically does 10 POSTs in a row. In a new cluster with no load it was almost certain that already the second POST would fail. Also, the terraform provider uses internally pulsarctl which does not have any retry mechanism, although it could, at least while the internals of the server are not fixed?
I am in the exact same situation as @jaak-pruulmann-sympower - have you found any workaround? I just terraform apply
repeatedly until all the changes go through, which under many circumstances takes a long time.
I'm interested in contributing a fix to this since it's a pretty frequent pain-point for me; I like the idea of handling the BadVersionException
by retrying with a back-off. For consistency-oriented systems like ZooKeeper I'm not sure there's another workaround besides removing the point of contention entirely. Maybe a lock would work too but that's effectively the same thing as retrying on BadVersionExceptions
(i.e. you can still hit a timeout acquiring the lock). This'd be my first contribution so any guidance would be appreciated
edit: just noticed a zk sequential node was mentioned; this still needs a timeout since the caller needs to wait for the queue item to be processed and if the queue gets stuck then the caller will need to timeout. Still a fan of just retrying on BadVersionException to handle majority of cases where maybe 10s or hundreds of updates occur concurrently
Expected behavior
Do a REST call to admin interface for setting topic permissions like:
body:
Should work every time and return status 204.
Actual behavior
When this is done with heavy load setting up permission for many topics in parallel. Some will result in the below logged exception and status code 500.
Steps to reproduce
Set up tenant and namespace, then run the set permission for many different topics in that namespace in parallel. I use the same role and actions for all of them.
System configuration
pulsar: 2.2.0
This error was when running the broker in standalone mode in a docker container supplied on docker hub, with authorisation and authentication via client certs. Have not tested on our full pulsar setup yet.
Logs