Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.87k stars 2.94k forks source link

fix using unavailable worker cached in client-pool #18693

Open yws-tracy opened 2 months ago

yws-tracy commented 2 months ago

Alluxio Version: 2.9.4

Describe the bug

In our product env, we met lots of error, "Unexpected error invoking REST endpoint: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=xxx-worker-20.xxx.com, SocketAddress=xxx-worker-20.xxx.com/13.26.4.81:29999

but we found the worker xxx-worker-20.xxx.com actually is healthy, running in good status. after making a deep troubleshooting, we found out the problem. there is bug in clientPoolKey in some cases.

when it need a BlockWorkerClient, it will call FileSystemContext#acquireBlockWorkerClientInternal image

image

image

image

we can see above , for the key ClientPoolKey of mBlockWorkerClientPoolMap, as long as the IP is the same , it will reuse ClientPool.

client will get workerConnection from clientPooMap which may have inCorrect worker in some cases.

In our production k8s network env, ip may be reused. for example, the pod worker1 may be deleted for some reason, and it's ip was recycled, when we scale out next time, the new worker (eg, worker20) may reuse the recycled ip of worker1.

When the client tries to establish worker20, it will failed because the cache of client pool will return the cached worker1's BlockWorkerClientPool which is invalid actually. the reason is that the worker20 has same ip as the invalid worker1 that was cached in the client pool

To Reproduce In reuse ip scenario, it can be reproduced easily

  1. delete worker1
  2. scale out workerN using worker1's ip
  3. workerN can not be established

Expected behavior the new scaled workerN can serve normally, the client can establish correct connection with workerN

Urgency only in reused ip scenario, it's not so much urgent Are you planning to fix it yes, I fix it in our production env using a simple way