apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.19k stars 3.58k forks source link

Rebalancing functions does nothing #12928

Open devinbost opened 2 years ago

devinbost commented 2 years ago

Describe the bug Hitting the endpoint to rebalance functions does not appear to work consistently in Pulsar 2.7.2.

To Reproduce Steps to reproduce the behavior:

First, we look at the function assignments:

$ curl fab08.example.domain.com:8080/admin/v2/worker/assignments -H "Authorization: Bearer eyJhb...mNog" | python -m json.tool

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1723 100 1723 0 0 210k 0 --:--:-- --:--:-- --:--:-- 210k

{

"c-pulsar-pcdc1-green-test-fw-fab09.example.domain.com-8080": [
"amplitude/processing/random-2:5",
"amplitude/processing/random-2:4",
"amplitude/processing/random-2:3",
"amplitude/processing/random-2:2",
"amplitude/processing/random-2:9",
"amplitude/processing/random-2:8",
"amplitude/processing/random-2:7",
"amplitude/processing/random-2:6",
"amplitude/processing/random-2:1",
"amplitude/processing/random-2:0",
"amplitude/processing/random-1:23",
"amplitude/processing/random-1:21",
"amplitude/processing/random-1:22",
"amplitude/processing/random-1:20",
"amplitude/processing/random-1:14",
"amplitude/processing/random-1:6",
"amplitude/processing/random-1:5",
"amplitude/processing/random-1:15",
"amplitude/processing/random-1:4",
"amplitude/processing/random-1:12",
"amplitude/processing/random-1:13",
"amplitude/processing/random-1:3",
"amplitude/processing/random-2:23",
"amplitude/processing/random-1:10",
"amplitude/processing/random-2:22",
"amplitude/processing/random-1:11",
"amplitude/processing/random-1:9",
"amplitude/processing/random-2:21",
"amplitude/processing/random-1:8",
"amplitude/processing/random-2:20",
"amplitude/processing/random-1:7",
"amplitude/processing/random-1:18",
"amplitude/processing/random-1:19",
"amplitude/processing/random-1:16",
"amplitude/processing/random-1:17",
"amplitude/processing/random-1:2",
"amplitude/processing/random-1:1",
"amplitude/processing/random-1:0",
"amplitude/processing/random-2:16",
"amplitude/processing/random-2:15",
"amplitude/processing/random-2:14",
"amplitude/processing/random-2:13",
"amplitude/processing/random-2:12",
"amplitude/processing/random-2:11",
"amplitude/processing/random-2:10",
"amplitude/processing/random-2:19",
"amplitude/processing/random-2:18",
"amplitude/processing/random-2:17"
]

}

Next, we trigger functions to rebalance:

$ curl fab08.example.domain.com:8080/admin/v2/worker/rebalance -X PUT -H "Authorization: Bearer eyJ...mNog"

Checking function assignments again after a few minutes shows no changes, as demonstrated below:

$ curl fab08.example.domain.com:8080/admin/v2/worker/assignments -H "Authorization: Bearer eyJ...mNog" | python -m json.tool

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1723 100 1723 0 0 560k 0 --:--:-- --:--:-- --:--:-- 560k

{

"c-pulsar-pcdc1-green-test-fw-fab09.example.domain.com-8080": [
"amplitude/processing/random-2:5",
"amplitude/processing/random-2:4",
"amplitude/processing/random-2:3",
"amplitude/processing/random-2:2",
"amplitude/processing/random-2:9",
"amplitude/processing/random-2:8",
"amplitude/processing/random-2:7",
"amplitude/processing/random-2:6",
"amplitude/processing/random-2:1",
"amplitude/processing/random-2:0",
"amplitude/processing/random-1:23",
"amplitude/processing/random-1:21",
"amplitude/processing/random-1:22",
"amplitude/processing/random-1:20",
"amplitude/processing/random-1:14",
"amplitude/processing/random-1:6",
"amplitude/processing/random-1:5",
"amplitude/processing/random-1:15",
"amplitude/processing/random-1:4",
"amplitude/processing/random-1:12",
"amplitude/processing/random-1:13",
"amplitude/processing/random-1:3",
"amplitude/processing/random-2:23",
"amplitude/processing/random-1:10",
"amplitude/processing/random-2:22",
"amplitude/processing/random-1:11",
"amplitude/processing/random-1:9",
"amplitude/processing/random-2:21",
"amplitude/processing/random-1:8",
"amplitude/processing/random-2:20",
"amplitude/processing/random-1:7",
"amplitude/processing/random-1:18",
"amplitude/processing/random-1:19",
"amplitude/processing/random-1:16",
"amplitude/processing/random-1:17",
"amplitude/processing/random-1:2",
"amplitude/processing/random-1:1",
"amplitude/processing/random-1:0",
"amplitude/processing/random-2:16",
"amplitude/processing/random-2:15",
"amplitude/processing/random-2:14",
"amplitude/processing/random-2:13",
"amplitude/processing/random-2:12",
"amplitude/processing/random-2:11",
"amplitude/processing/random-2:10",
"amplitude/processing/random-2:19",
"amplitude/processing/random-2:18",
"amplitude/processing/random-2:17"
]

}

After some experimentation, I discovered that I was able to trigger rebalancing to occur if I targeted the function worker leader, but it's not clear if this happens consistently or not.

Expected behavior Triggering function rebalancing should work consistently when triggered on any broker. Also, if there is a failure, it should be reported in the logs. In the current implementation, when sending the rebalance request, no logs appeared in the targeted broker except when it succeeded. More logging should indicate if there's a problem on the broker that receives the signal to rebalance functions.

devinbost commented 2 years ago

@jerrypeng FYI

eolivelli commented 2 years ago

I believe that the problem is that you are missing "-L", that tells curl to follow the redirections.

I have a test cluster with two workers, on port 8080 and 6751 The worker on port 8080 is the "leader"

If I issue this command: curl -v -L -X PUT http://localhost:6751/admin/v2/worker/rebalance

This is the output:

*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 6751 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 6751 (#0)
> PUT /admin/v2/worker/rebalance HTTP/1.1
> Host: localhost:6751
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 307 Temporary Redirect
< Date: Wed, 01 Dec 2021 11:08:54 GMT
< Location: http://localhost:8080/admin/v2/worker/rebalance
< Content-Length: 0
< Server: Jetty(9.4.43.v20210629)
< 
* Connection #0 to host localhost left intact
* Issue another request to this URL: 'http://localhost:8080/admin/v2/worker/rebalance'
* Found bundle for host localhost: 0x7fb020e05590 [can pipeline]
* Could pipeline, but not asked to!
*   Trying ::1...
* TCP_NODELAY set
* Connection failed
* connect to ::1 port 8080 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#1)
> PUT /admin/v2/worker/rebalance HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 204 No Content
< Date: Wed, 01 Dec 2021 11:08:54 GMT
< broker-address: localhost
< Server: Jetty(9.4.43.v20210629)
< 
* Connection #1 to host localhost left intact
* Closing connection 0
* Closing connection 1

Then I see correctly the logs on the "leader" that report that the "rebalance" is working

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.