In our product env, we met lots of error,
"Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999
but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.
when it need a BlockWorkerClient, it will call
FileSystemContext#acquireBlockWorkerClientInternal
we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.
client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.
In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.
When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool
To Reproduce
In reuse ip scenario, it can be reproduced easily
delete worker1
scale out workerN using worker1's ip
workerN can not be established
Expected behavior
the new scaled workerN can serve normally, the client can establish correct connection with workerN
Urgency
only in reused ip scenario, it's not so much urgent
Are you planning to fix it
yes, I fix it in our production env using a simple way
Alluxio Version: 2.9.4
Describe the bug
In our product env, we met lots of error, "Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999
but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.
when it need a BlockWorkerClient, it will call FileSystemContext#acquireBlockWorkerClientInternal
we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.
client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.
In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.
When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool
To Reproduce In reuse ip scenario, it can be reproduced easily
Expected behavior the new scaled workerN can serve normally, the client can establish correct connection with workerN
Urgency only in reused ip scenario, it's not so much urgent Are you planning to fix it yes, I fix it in our production env using a simple way