Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.79k stars 2.93k forks source link

I have some problem about use fluid+alluxio to cache cephfs data #16757

Open andyzheung opened 1 year ago

andyzheung commented 1 year ago

Alluxio Version: k8s: v1.21.0 fluid: v0.8.1 alluxio: v2.8.1

Describe the bug Fluid+Alluxio cache cephfs data on gpu cluster. delete alluxio runtime controller time is to long..

To Reproduce use fluid+alluxio to cache so much cephfs root dir, may be have 100T datas. Can't delete the alluxioruntime cr.

Expected behavior Can delete the alluxio runtime more quicly.

Urgency can't delete the alluxio master and worker, the command is timeout.

Are you planning to fix it If alluxio can provider more quickly clean cache way? I login to gpu cluster and rm -rf /xxxx is very fast.

Additional context

alluxioruntime.txt

alluxio master 11C856E3-4FAF-4b53-9B2B-74FB1584D03A.txt

beinan commented 1 year ago

@ssz1997 is it possible take a look at this issue? are we doing something expensive during the cr decommission?

ssz1997 commented 1 year ago

@andyzheung From master log I see master experienced an OOM but did not kill/restart itself, so the service is actually down. Then the deletion of the cluster hang because of the unavailability of master, which leads to time out.

Thanks for reporting. We will investigate on why master did not kill/restart after OOM.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.