fix using unavailable worker cached in client-pool

Alluxio Version: 2.9.4

Describe the bug

In our product env, we met lots of error, "Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999

but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.

when it need a BlockWorkerClient, it will call FileSystemContext#acquireBlockWorkerClientInternal

we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.

client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.

In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.

When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool

To Reproduce In reuse ip scenario, it can be reproduced easily

delete worker1
scale out workerN using worker1's ip
workerN can not be established

Expected behavior the new scaled workerN can serve normally, the client can establish correct connection with workerN

Urgency only in reused ip scenario, it's not so much urgent Are you planning to fix it yes, I fix it in our production env using a simple way

Alluxio / alluxio

fix using unavailable worker cached in client-pool #18693