fluid-cloudnative / fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
https://fluid-cloudnative.github.io/
Apache License 2.0
1.68k stars 960 forks source link

[BUG] Read file slow with alluxioruntime #1127

Open robin1900 opened 3 years ago

robin1900 commented 3 years ago

What is your environment(Kubernetes version, Fluid version, etc.)

Fluid 0.60 alluxioruntime , use lustre as backend storage

Describe the bug

run in pod with dataset bound

time cp alluxio-2.6.2-bin.tar.gz /dev/null

real 1m36.357s user 0m0.014s sys 0m1.047s

the file is about 1.8GB , why read file is so slow for the first time


apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: lustre spec: mounts:


apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: lustre spec:

Add fields here

replicas: 3 data: replicas: 2

tieredstore: levels:

cheyang commented 3 years ago

Do you have tried to copy this file from lustre to local file system? How many time does it take?

robin1900 commented 3 years ago

Do you have tried to copy this file from lustre to local file system? How many time does it take?

yes . only take 2 seconds

cheyang commented 3 years ago

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

robin1900 commented 3 years ago

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

yes, I cleaned zhe page cache. it takes 4 seconds from lustre. the lustre is small cluster with 3 nodes , and base on ssd. so it is fast

robin1900 commented 3 years ago

You can check throughput with https://docs.alluxio.io/os/user/stable/en/operation/Admin-CLI.html#metrics. Are they using the same machine to copy file, and did you clear page cache. Actually , If it took 2 seconds to copy 1.8 GiB file, it means nearly 1 GiB/second. Can you confirm this?

I got the metrics for copy the file(before copy I cleared metrics already):

Cluster.BytesReadDirect (Type: COUNTER, Value: 0B) Cluster.BytesReadDirectThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadDomain (Type: COUNTER, Value: 0B) Cluster.BytesReadDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadLocal (Type: COUNTER, Value: 0B) Cluster.BytesReadLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesReadRemote (Type: COUNTER, Value: 84.17GB) Cluster.BytesReadRemoteThroughput (Type: GAUGE, Value: 42.09GB/MIN) Cluster.BytesReadUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenDomain (Type: COUNTER, Value: 0B) Cluster.BytesWrittenDomainThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenLocal (Type: COUNTER, Value: 0B) Cluster.BytesWrittenLocalThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenRemote (Type: COUNTER, Value: 0B) Cluster.BytesWrittenRemoteThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.BytesWrittenUfsAll (Type: COUNTER, Value: 0B) Cluster.BytesWrittenUfsThroughput (Type: GAUGE, Value: 0B/MIN) Cluster.CapacityFree (Type: GAUGE, Value: 571,518,496,412) Cluster.CapacityFreeTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityFreeTierMEM (Type: GAUGE, Value: 91,555,901,084) Cluster.CapacityFreeTierSSD (Type: GAUGE, Value: 479,962,595,328) Cluster.CapacityTotal (Type: GAUGE, Value: 579,820,584,960) Cluster.CapacityTotalTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityTotalTierMEM (Type: GAUGE, Value: 96,636,764,160) Cluster.CapacityTotalTierSSD (Type: GAUGE, Value: 483,183,820,800) Cluster.CapacityUsed (Type: GAUGE, Value: 8,302,088,548) Cluster.CapacityUsedTierHDD (Type: GAUGE, Value: 0) Cluster.CapacityUsedTierMEM (Type: GAUGE, Value: 5,080,863,076) Cluster.CapacityUsedTierSSD (Type: GAUGE, Value: 3,221,225,472) Cluster.RootUfsCapacityFree (Type: GAUGE, Value: 193,345,646,592) Cluster.RootUfsCapacityTotal (Type: GAUGE, Value: 211,243,999,232) Cluster.RootUfsCapacityUsed (Type: GAUGE, Value: 17,898,352,640) Cluster.Workers (Type: GAUGE, Value: 3) Master.AbsentCacheHits (Type: COUNTER, Value: 0) Master.AbsentCacheInvalidations (Type: COUNTER, Value: 0) Master.AbsentCacheMisses (Type: COUNTER, Value: 0) Master.AbsentCacheSize (Type: GAUGE, Value: 0) Master.CompleteFileOps (Type: COUNTER, Value: 0) Master.ConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.Create.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.CreateDirectoryOps (Type: COUNTER, Value: 0) Master.CreateFileOps (Type: COUNTER, Value: 0) Master.DeletePathOps (Type: COUNTER, Value: 0) Master.DirectoriesCreated (Type: COUNTER, Value: 0) Master.EdgeCacheEvictions (Type: COUNTER, Value: 0) Master.EdgeCacheHits (Type: COUNTER, Value: 2,136,970) Master.EdgeCacheLoadTimes (Type: COUNTER, Value: 0) Master.EdgeCacheMisses (Type: COUNTER, Value: 0) Master.EdgeCacheSize (Type: GAUGE, Value: 123,438) Master.EdgeLockPoolSize (Type: GAUGE, Value: 123,439) Master.FileBlockInfosGot (Type: COUNTER, Value: 0) Master.FileInfosGot (Type: COUNTER, Value: 15) Master.FilesCompleted (Type: COUNTER, Value: 0) Master.FilesCreated (Type: COUNTER, Value: 0) Master.FilesFreed (Type: COUNTER, Value: 0) Master.FilesPersisted (Type: COUNTER, Value: 0) Master.FilesPinned (Type: GAUGE, Value: 0) Master.FreeFileOps (Type: COUNTER, Value: 0) Master.GetAcl.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetAcl.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetBlockMasterInfo.User:root (Type: TIMER, Value: 88) Master.GetFileBlockInfoOps (Type: COUNTER, Value: 0) Master.GetFileInfoOps (Type: COUNTER, Value: 4) Master.GetFileLocations.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetNewBlockOps (Type: COUNTER, Value: 0) Master.GetSpace.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetSpace.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 9) Master.GetStatus.User:root (Type: TIMER, Value: 2) Master.GetStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.GetWorkerInfoList.User:root (Type: TIMER, Value: 2) Master.InodeCacheEvictions (Type: COUNTER, Value: 0) Master.InodeCacheHits (Type: COUNTER, Value: 13,135,456) Master.InodeCacheLoadTimes (Type: COUNTER, Value: 0) Master.InodeCacheMisses (Type: COUNTER, Value: 0) Master.InodeCacheSize (Type: GAUGE, Value: 123,439) Master.InodeLockPoolSize (Type: GAUGE, Value: 123,439) Master.JournalFlushTimer (Type: TIMER, Value: 0) Master.LastBackupEntriesCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreCount (Type: GAUGE, Value: -1) Master.LastBackupRestoreTimeMs (Type: GAUGE, Value: -1) Master.LastBackupTimeMs (Type: GAUGE, Value: -1) Master.ListStatus.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 15) Master.ListStatus.User:root (Type: TIMER, Value: 2) Master.ListStatus.User:root.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.ListingCacheEvictions (Type: COUNTER, Value: 0) Master.ListingCacheHits (Type: COUNTER, Value: 2) Master.ListingCacheLoadTimes (Type: COUNTER, Value: 0) Master.ListingCacheMisses (Type: COUNTER, Value: 0) Master.ListingCacheSize (Type: GAUGE, Value: 124,005) Master.MountOps (Type: COUNTER, Value: 0) Master.NewBlocksGot (Type: COUNTER, Value: 0) Master.PathsDeleted (Type: COUNTER, Value: 0) Master.PathsMounted (Type: COUNTER, Value: 0) Master.PathsRenamed (Type: COUNTER, Value: 0) Master.PathsUnmounted (Type: COUNTER, Value: 0) Master.PerUfsOpConnectFromMaster.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpCreate.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetFileLocations.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpGetSpace.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 9) Master.PerUfsOpGetStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 0) Master.PerUfsOpListStatus.UFS:%2Fjournal%2FMetricsMaster (Type: GAUGE, Value: 15) Master.PerUfsSavedOpGET_FILE_INFO.UFS:%2FunderFSStorage (Type: COUNTER, Value: 13) Master.PerUfsSavedOpLIST_STATUS.UFS:%2FunderFSStorage (Type: COUNTER, Value: 2) Master.RenamePathOps (Type: COUNTER, Value: 0) Master.SetAclOps (Type: COUNTER, Value: 0) Master.SetAttributeOps (Type: COUNTER, Value: 0) Master.TotalPaths (Type: GAUGE, Value: 123,439) Master.UfsJournalInitialReplayTimeMs (Type: GAUGE, Value: -1) Master.UfsSessionCount-Ufs:%2Fjournal%2FMetricsMaster (Type: COUNTER, Value: 0) Master.UfsSessionCount-Ufs:%2FunderFSStorage (Type: COUNTER, Value: 0) Master.UnmountOps (Type: COUNTER, Value: 0) Master.blockHeartbeat.User:root (Type: TIMER, Value: 468) Master.clearMetrics.User:root (Type: TIMER, Value: 0) Master.close.UFS:%2Fjournal%2FMetricsMaster.UFS_TYPE:local (Type: TIMER, Value: 0) Master.commitBlock.User:root (Type: TIMER, Value: 111) Master.getConfigHash (Type: TIMER, Value: 12) Master.getConfigHash.User:root (Type: TIMER, Value: 33) Master.getConfiguration (Type: TIMER, Value: 2) Master.getConfiguration.User:root (Type: TIMER, Value: 103) Master.getMasterInfo.User:root (Type: TIMER, Value: 88) Master.getMetrics.User:root (Type: TIMER, Value: 3) Master.getPinnedFileIds.User:root (Type: TIMER, Value: 468) Master.getUfsInfo.User:root (Type: TIMER, Value: 1) Master.getWorkerId.User:root (Type: TIMER, Value: 0) Master.registerWorker.User:root (Type: TIMER, Value: 0)

cheyang commented 3 years ago

I suggest you to repeat the experiment and check Cluster.BytesReadUfsThroughput (Type: GAUGE, Value: 0B/MIN). Thanks. @apc999 Any sugggestions.