Read from caching data much slower than read from UFS directly

LuQQiu commented 2 years ago

Alluxio Version: 2.7.2

Describe the bug

Alluxio cluster is launched with 1 master and 8 workers. 40 training nodes are launched to read data from 40 co-located Fuse pods. Each Fuse pod reads 1/40 of the total data. Data amount is 2TB total, about 70MB each, about 30k files. Each Fuse pod reads 50GB data.

Clean the whole cache during 15:00

the first round cold read from Alluxio is very fast, on average takes 15 minutes (consistent) the second round warm read from Alluxio, the speed drops, on average is 50% of the cold read speed, takes 40 minutes (not consistent)

Bytes read UFS = increase(Cluster_BytesReadUfsAll{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m]) Bytes Read Cache = increase(Cluster_BytesReadRemote{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m]) - increase(Cluster_BytesReadUfsAll{fluid_runtime=~"$fluid_runtime",namespace=~"$namespace"}[2m])

image (5) image (6)

image (7)

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen.

Urgency Describe the impact and urgency of the bug.

Are you planning to fix it Please indicate if you are already working on a PR.

Additional context Add any other context about the problem here.

LuQQiu commented 2 years ago

Try to scale down the problem: Fuse with 1 master, 1 worker, and 5 Fuse pods (scaled down from 8 worker and 40 Fuse pods). The load is same for this worker and each Fuse pod. Each Fuse pod still reads about 50GB data. worker to Fuse pods ratio is similar. But with this setup, reading from cache is faster than reading from UFS, which is the expected behavior

LuQQiu commented 2 years ago

Rerunning the test with 1 master, 8 worker, and 40 Fuse pods. Master - really low CPU usage (below 10%), no much network Worker - medium level CPU usage (about 10 - 30%), many disk read, network read and send Fuse - a low CPU usage (3-30%), some write to disk?, receive a lot of data through network image (8) Cache takes about 30min, compared to UFS takes 15min

From master jstack - 128 rpc executor threads are waiting for tasks, master is pretty idle

alluxio.master.rpc.executor.core.pool.size=128

master-rpc-executor-TPE-thread-1
stackTrace:
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007efddbc297a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Fuse 19 fuse reading threads, 3/19 is blocked on waiting to lock <0x00007fdd22b18a00> (a alluxio.client.file.AlluxioFileInStream), 3/19 is blocked on waiting to receive data from network.

Next steps:

check read from UFS and read from cache differences (readtype, how many layer of storage, block size)
lock/thread profiling
block size?
logs?
alluxio.fuse.shared.caching.reader.enabled ?
what is Bytes Read UFS, and what is Bytes Read Cache, is the comparison fair?

LuQQiu commented 2 years ago

@Nizifan Compare Cluster_BytesReadUfsAll and Cluster_BytesReadRemote is not apple to apple comparison. They are at different levels in the code base.

BlockReadHandler (BytesReadRemote recorded at this level) -> getDataBuffer -> UFSBlockReader(BytesReadUfs recorded at this level)

BytesReadRemote includes more logics (locking, executor execute) and BytesReadUfs is simply getting data from ufs block.

Better to compare the performance from the client-side directly. (e.g. what's the actual read performance differences between cold read and warm read on application level)

But the differences between BytesReadRemote and BytesReadUfs do provide some information about the delay of acquire locks or executor threads

LuQQiu commented 2 years ago

jstack-worker.txt jstack-master.txt Get the jstack when running read from alluxio cache.

From worker jstack: there are only one BlockReader thread, only one file is currently reading from this worker.

BlockDataReaderExecutor-1046

stackTrace:
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:159)
- locked <0x00007efa2115d2c8> (a java.lang.Object)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
at alluxio.worker.block.io.LocalFileBlockReader.transferTo(LocalFileBlockReader.java:105)
at alluxio.worker.block.io.DelegatingBlockReader.transferTo(DelegatingBlockReader.java:64)
at alluxio.worker.grpc.BlockReadHandler$DataReader.getDataBuffer(BlockReadHandler.java:518)
at alluxio.worker.grpc.BlockReadHandler$DataReader.runInternal(BlockReadHandler.java:405)
at alluxio.worker.grpc.BlockReadHandler$DataReader.run(BlockReadHandler.java:367)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at alluxio.worker.grpc.GrpcExecutors$ImpersonateThreadPoolExecutor.lambda$execute$0(GrpcExecutors.java:125)
at alluxio.worker.grpc.GrpcExecutors$ImpersonateThreadPoolExecutor$$Lambda$338/1544229824.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00007efa0f52b288> (a java.util.concurrent.ThreadPoolExecutor$Worker)

From the fuse jstack, There are 17 libfuse reading threads ("Thread-"). thread 6 and 13, thread 7 and 12, thread 8 and 10 are reading the same files. So libfuse probably is reading 3 to 14 files from this fuse.

There are only two grpc-default-executor threads, these two threads are waiting for data to arrive. 8 alluxio-client-netty-event-loop-RPC threads and 8 alluxio-client-netty-event-loop-STREAMING threads

LuQQiu commented 2 years ago

Why this worker has only one BlockReader thread (only serving one file read request)

we are not taking jstack in the middle of the read from cache workflow? (@Nizifan can you help validate it, or take multiple worker jstacks from multiple workers)
The worker cache is unbalanced, some workers have more data cached locally while some workers have less. So the load for each worker is different
Not enough load from Fuse?
- From this Fuse, it is reading at least 3 files and up to 14 files
- could it be some Fuse read from some workers stuck (worker bad healthy?)?

LuQQiu commented 2 years ago

@Nizifan Some suggestions for investigating this issue

After read from Ufs, check if the worker cached size is balanced. whether each of the worker cached 1/8 of the total data size?
During the read from UFS, and read from Alluxio cache (for comparison) (1) take multiple worker jstack in the middle of the two read, if the worker cache is balanced, choose random two workers. If it's unbalanced, choose the one with most cache and the one with least cache (2) take multiple fuse jstack in the middle of the two read. (2) For the fuse process, jstack may not be enough since many threads are just Running states which doesn't give much information. Thread profiling with async_profiler or visualVM (may be hard to use in K8S env) is helpful. If you have extra time, can you help take the read from UFS thread profiling and read from cache thread profiling using async_profiler?

LuQQiu commented 2 years ago

For more information about thread profiling with async-profiler https://github.com/jvm-profiling-tools/async-profiler Profiling the CPU time

cd async-profiler && \
./profiler.sh -e cpu -d <profiling duration> -f cpu.html `jps | grep AlluxioFuse | awk '{print $1}'`

The command launches after your program running, after the reading from UFS/Cache starts, the time duration is better to be able to cover most of the test time, e.g. 15min for read from ufs, 30min for read from cache.

Profiling the wall-clock may also be helpful

cd async-profiler && \
./profiler.sh -e wall -t -d <profiling duration> -f cpu.html `jps | grep AlluxioFuse | awk '{print $1}'`

Better to have two profiling results for comparison (normal -> read from ufs, to debug one read from cache)

Nizifan commented 2 years ago

jstack-worker-ufs1.txt jstack-worker-ufs2.txt jstack-worker-ufs3.txt fuse-jstack-ufs-1.txt

fuse-jstack-ufs-3.txt

fuse-jstack-ufs-2.txt

LuQQiu commented 2 years ago

fuse-jstack-cache-3.txt fuse-jstack-cache-2.txt fuse-jstack-cache-1.txt jstack-worker-cache3.txt jstack-worker-cache2.txt jstack-worker-cache1.txt

LuQQiu commented 2 years ago

jstack-worker-cache1: 30 BlockDataReaderExecutor thread, 6 blocked, authentication.AuthenticatedClientUser.set 6 Runnable 1 getBlockLock 4 read from ufs 5 waiting for task 4 reading file 3

Jstack-worker-cache2 4 BlockDataReaderExecutor, 3 waiting for task, one is reading file

jstack-worker-cache3 12BlockDataReaderExecutor, 3 waiting for task, 9 reading file

LuQQiu commented 2 years ago

jstack-worker-ufs1 6 block data reader executor, 4 blocked, 2 running ufs2, total 7, 3 waiting, 4 reading ufs3, total 7, 1 waiting, 6 reading

LuQQiu commented 2 years ago

From the jstack, could not find obvious information. Based on your previous feedback, maybe some of the workers/fuse read fast but some are relatively slow?

LuQQiu commented 2 years ago

Tested in Alluxio Dataset 10K130MB files, total 1.3TB

Concurrency

Node Training_Process/Node Data_Loading_Process/Training_Process Total concurrency

8 5 8 320

Node	Training_Process/Node	Data_Loading_Process/Training_Process	Total concurrency
8	5	8	320

Alluxio read performance:

Batch size: 8

Data in Alluxio Fuse metadata cached Kernel cached Average batch read time (second) Node throughput Cluster throughput Cluster throughput (GB/2min)

1 No No No 33.56 1.22 file/s159.4MB/s 9.76 file/s 1.27 GB/s 152

2 Yes No No 23.27 1.72 file/s223.4MB/s 13.76 file/s1.785 GB/s 214

3 Yes No Yes 22.12 1.8 file/s235 MB/s 14.46 file/s1.88 GB/s 225

4 Yes Yes No 23.35 1.7 file/s 222.7 MB/s 13.7 file/s1.781 GB/s 214

6 Yes Yes Yes 16.12 2.98 files/s387.3 MB/s 23.8 files/s3.1 GB/s 370

	Data in Alluxio	Fuse metadata cached	Kernel cached	Average batch read time (second)	Node throughput	Cluster throughput	Cluster throughput (GB/2min)
1	No	No	No	33.56	1.22 file/s159.4MB/s	9.76 file/s 1.27 GB/s	152
2	Yes	No	No	23.27	1.72 file/s223.4MB/s	13.76 file/s1.785 GB/s	214
3	Yes	No	Yes	22.12	1.8 file/s235 MB/s	14.46 file/s1.88 GB/s	225
4	Yes	Yes	No	23.35	1.7 file/s 222.7 MB/s	13.7 file/s1.781 GB/s	214
6	Yes	Yes	Yes	16.12	2.98 files/s387.3 MB/s	23.8 files/s3.1 GB/s	370

The performance of read from cache doesn't show obvious bugs/problems. According to @Nizifan, their HDFS UFS can reach memory speed under low pressure. The setup is with SSD Alluxio worker storage, can because read from SSD is slower than read from HDFS directly. It will be helpful to direct test the throughput of local SSD vs HDFS without Alluxio involved

LuQQiu commented 2 years ago

Consider closing this issue for now, @Nizifan if you have future performance requirements or new discoveries, feel free to reopen this issue, thanks!

Alluxio / alluxio

Read from caching data much slower than read from UFS directly #15026