Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.8k stars 2.94k forks source link

[alluxioworker]:worker restart, affect monitoring data and client synchronization worker status #17549

Open ccy00808 opened 1 year ago

ccy00808 commented 1 year ago

Alluxio Version: 2.7.0

Describe the bug When the worker is restarted, the metadata of the worker has not been removed from the effective collection of the master. If it is re-registered, the worker will be locked. As a result, when the monitoring obtains worker-related data and the client synchronizes the worker status, the competing lock cannot be obtained and the worker needs to wait It can be restored after registration, 500W blocks need 2~4min, 2500W blocks need 2 hours

To Reproduce worker cache block 1000w, restart worker and execute repeat "wget ​​http://host:19999/metrics/prometheus/" to see if the completed monitoring data can be obtained

Expected behavior Data can be obtained normally

Urgency generally

Are you planning to fix it yes

jiacheliu3 commented 1 year ago

Yes your thought makes sense. I think one way is to make the BlockMaster remove the worker from the collection when re-registering a worker. Please let us know when you have the PR out, thanks!

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.