Open andyzheung opened 1 year ago
@ssz1997 is it possible take a look at this issue? are we doing something expensive during the cr decommission?
@andyzheung From master log I see master experienced an OOM but did not kill/restart itself, so the service is actually down. Then the deletion of the cluster hang because of the unavailability of master, which leads to time out.
Thanks for reporting. We will investigate on why master did not kill/restart after OOM.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
Alluxio Version: k8s: v1.21.0 fluid: v0.8.1 alluxio: v2.8.1
Describe the bug Fluid+Alluxio cache cephfs data on gpu cluster. delete alluxio runtime controller time is to long..
To Reproduce use fluid+alluxio to cache so much cephfs root dir, may be have 100T datas. Can't delete the alluxioruntime cr.
Expected behavior Can delete the alluxio runtime more quicly.
Urgency can't delete the alluxio master and worker, the command is timeout.
Are you planning to fix it If alluxio can provider more quickly clean cache way? I login to gpu cluster and rm -rf /xxxx is very fast.
Additional context
alluxioruntime.txt
alluxio master 11C856E3-4FAF-4b53-9B2B-74FB1584D03A.txt