Closed LuQQiu closed 2 years ago
Try to scale down the problem: Fuse with 1 master, 1 worker, and 5 Fuse pods (scaled down from 8 worker and 40 Fuse pods). The load is same for this worker and each Fuse pod. Each Fuse pod still reads about 50GB data. worker to Fuse pods ratio is similar. But with this setup, reading from cache is faster than reading from UFS, which is the expected behavior
Rerunning the test with 1 master, 8 worker, and 40 Fuse pods. Master - really low CPU usage (below 10%), no much network Worker - medium level CPU usage (about 10 - 30%), many disk read, network read and send Fuse - a low CPU usage (3-30%), some write to disk?, receive a lot of data through network Cache takes about 30min, compared to UFS takes 15min
From master jstack - 128 rpc executor threads are waiting for tasks, master is pretty idle
alluxio.master.rpc.executor.core.pool.size=128
master-rpc-executor-TPE-thread-1
stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007efddbc297a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Fuse 19 fuse reading threads, 3/19 is blocked on waiting to lock <0x00007fdd22b18a00> (a alluxio.client.file.AlluxioFileInStream)
, 3/19 is blocked on waiting to receive data from network
.
Next steps:
@Nizifan
Compare Cluster_BytesReadUfsAll
and Cluster_BytesReadRemote
is not apple to apple comparison.
They are at different levels in the code base.
BlockReadHandler (BytesReadRemote recorded at this level) -> getDataBuffer -> UFSBlockReader(BytesReadUfs recorded at this level)
BytesReadRemote includes more logics (locking, executor execute) and BytesReadUfs is simply getting data from ufs block.
Better to compare the performance from the client-side directly. (e.g. what's the actual read performance differences between cold read and warm read on application level)
But the differences between BytesReadRemote and BytesReadUfs do provide some information about the delay of acquire locks or executor threads
jstack-worker.txt jstack-master.txt Get the jstack when running read from alluxio cache.
From worker jstack: there are only one BlockReader thread, only one file is currently reading from this worker.
BlockDataReaderExecutor-1046
stackTrace:
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:159)
- locked <0x00007efa2115d2c8> (a java.lang.Object)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
at alluxio.worker.block.io.LocalFileBlockReader.transferTo(LocalFileBlockReader.java:105)
at alluxio.worker.block.io.DelegatingBlockReader.transferTo(DelegatingBlockReader.java:64)
at alluxio.worker.grpc.BlockReadHandler$DataReader.getDataBuffer(BlockReadHandler.java:518)
at alluxio.worker.grpc.BlockReadHandler$DataReader.runInternal(BlockReadHandler.java:405)
at alluxio.worker.grpc.BlockReadHandler$DataReader.run(BlockReadHandler.java:367)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at alluxio.worker.grpc.GrpcExecutors$ImpersonateThreadPoolExecutor.lambda$execute$0(GrpcExecutors.java:125)
at alluxio.worker.grpc.GrpcExecutors$ImpersonateThreadPoolExecutor$$Lambda$338/1544229824.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00007efa0f52b288> (a java.util.concurrent.ThreadPoolExecutor$Worker)
From the fuse jstack,
There are 17 libfuse reading threads ("Thread-
There are only two grpc-default-executor
threads, these two threads are waiting for data to arrive.
8 alluxio-client-netty-event-loop-RPC
threads and 8 alluxio-client-netty-event-loop-STREAMING
threads
Why this worker has only one BlockReader thread (only serving one file read request)
@Nizifan Some suggestions for investigating this issue
jstack
in the middle of the two read, if the worker cache is balanced, choose random two workers. If it's unbalanced, choose the one with most cache and the one with least cache
(2) take multiple fuse jstack
in the middle of the two read.
(2) For the fuse process, jstack may not be enough since many threads are just Running
states which doesn't give much information. Thread profiling with async_profiler
or visualVM
(may be hard to use in K8S env) is helpful. If you have extra time, can you help take the read from UFS thread profiling and read from cache thread profiling using async_profiler
?For more information about thread profiling with async-profiler https://github.com/jvm-profiling-tools/async-profiler Profiling the CPU time
cd async-profiler && \
./profiler.sh -e cpu -d <profiling duration> -f cpu.html `jps | grep AlluxioFuse | awk '{print $1}'`
The command launches after your program running, after the reading from UFS/Cache starts, the time duration is better to be able to cover most of the test time, e.g. 15min for read from ufs, 30min for read from cache.
Profiling the wall-clock may also be helpful
cd async-profiler && \
./profiler.sh -e wall -t -d <profiling duration> -f cpu.html `jps | grep AlluxioFuse | awk '{print $1}'`
Better to have two profiling results for comparison (normal -> read from ufs, to debug one read from cache)
jstack-worker-cache1: 30 BlockDataReaderExecutor thread, 6 blocked, authentication.AuthenticatedClientUser.set 6 Runnable 1 getBlockLock 4 read from ufs 5 waiting for task 4 reading file 3
Jstack-worker-cache2 4 BlockDataReaderExecutor, 3 waiting for task, one is reading file
jstack-worker-cache3 12BlockDataReaderExecutor, 3 waiting for task, 9 reading file
jstack-worker-ufs1 6 block data reader executor, 4 blocked, 2 running ufs2, total 7, 3 waiting, 4 reading ufs3, total 7, 1 waiting, 6 reading
From the jstack, could not find obvious information. Based on your previous feedback, maybe some of the workers/fuse read fast but some are relatively slow?
Tested in Alluxio Dataset 10K130MB files, total 1.3TB
Concurrency
Node | Training_Process/Node | Data_Loading_Process/Training_Process | Total concurrency |
---|---|---|---|
8 | 5 | 8 | 320 |
Alluxio read performance:
Batch size: 8
Data in Alluxio | Fuse metadata cached | Kernel cached | Average batch read time (second) | Node throughput | Cluster throughput | Cluster throughput (GB/2min) | |
---|---|---|---|---|---|---|---|
1 | No | No | No | 33.56 | 1.22 file/s159.4MB/s | 9.76 file/s 1.27 GB/s | 152 |
2 | Yes | No | No | 23.27 | 1.72 file/s223.4MB/s | 13.76 file/s1.785 GB/s | 214 |
3 | Yes | No | Yes | 22.12 | 1.8 file/s235 MB/s | 14.46 file/s1.88 GB/s | 225 |
4 | Yes | Yes | No | 23.35 | 1.7 file/s 222.7 MB/s | 13.7 file/s1.781 GB/s | 214 |
6 | Yes | Yes | Yes | 16.12 | 2.98 files/s387.3 MB/s | 23.8 files/s3.1 GB/s | 370 |
The performance of read from cache doesn't show obvious bugs/problems. According to @Nizifan, their HDFS UFS can reach memory speed under low pressure. The setup is with SSD Alluxio worker storage, can because read from SSD is slower than read from HDFS directly. It will be helpful to direct test the throughput of local SSD vs HDFS without Alluxio involved
Consider closing this issue for now, @Nizifan if you have future performance requirements or new discoveries, feel free to reopen this issue, thanks!
Alluxio Version: 2.7.2
Describe the bug
Alluxio cluster is launched with 1 master and 8 workers. 40 training nodes are launched to read data from 40 co-located Fuse pods. Each Fuse pod reads 1/40 of the total data. Data amount is 2TB total, about 70MB each, about 30k files. Each Fuse pod reads 50GB data.
Clean the whole cache during 15:00
the first round cold read from Alluxio is very fast, on average takes 15 minutes (consistent) the second round warm read from Alluxio, the speed drops, on average is 50% of the cold read speed, takes 40 minutes (not consistent)
Bytes read UFS = increase(Cluster_BytesReadUfsAll{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m]) Bytes Read Cache = increase(Cluster_BytesReadRemote{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m]) - increase(Cluster_BytesReadUfsAll{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m])
To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)
Expected behavior A clear and concise description of what you expected to happen.
Urgency Describe the impact and urgency of the bug.
Are you planning to fix it Please indicate if you are already working on a PR.
Additional context Add any other context about the problem here.