Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.87k stars 2.94k forks source link

Short-circuit reads and writes have lower performance than cross-node reads and writes #17967

Open wangw-david opened 1 year ago

wangw-david commented 1 year ago

Is your feature request related to a problem? Please describe. I deployed alluxio version 2.9.0 (alluxio/alluxio-dev:2.9.0) with fluid 0.9.1. Then I used 2 nodes A and B, here is my vdbench test configuration:

fsd=fsd1,anchor=/test,width=1,depth=1,files=1,size=10G,openflags=o_direct fwd=fwd1,fsd=fsd,operation=write,xfersize=4M,fileio=sequential,fileselect=sequential,threads=64 fwd=fwd2,fsd=fsd,operation=read,xfersize=4M,fileio=sequential,fileselect=sequential,threads=1 rd=rd1,fwd=fwd1,fwdrate=max,format=(restart,only),maxdata=10G,elapsed=300,warmup=5,interval=1 rd=rd2,fwd=fwd2,fwdrate=max,format=restart,maxdata=10G,elapsed=300,warmup=5,interval=1

Alluixo uses memory(tmpfs, quota 100G) as the tieredstore0, the read strategy is CACHE, and the write strategy is ASYNC_THROUGH, UFS uses s3.

I tested the following scenarios:

  1. Both worker and fuse are on node A: readBW: 688.2MB/s, writeBW: 296.94MB/s

ps: I checked the following configuI judge that this scenario uses short-circuit read and write through the following configuration and monitoring items:

config: alluxio.user.short.circuit.enabled=true alluxio.user.short.circuit.preferred=false

monitoring items: Cluster.BytesReadLocal (Type: COUNTER, Value: 12.80GB) Cluster.BytesReadLocalThroughput (Type: GAUGE, Value: 11.19MB/MIN) Cluster.BytesWrittenLocal (Type: COUNTER, Value: 10.00GB) Cluster.BytesWrittenLocalThroughput (Type: GAUGE, Value: 8.74MB/MIN)

  1. The worker is on node A, and the fuse is on node B: readBW: 29.49MB/s, writeBW: 379.26MB/s

In order to solve the situation of low read performance, I added the following configuration: alluxio.user.client.cache.enabled: "true" alluxio.user.client.cache.store.type: MEM alluxio.user.client.cache.size: 1GB alluxio.user.client.cache.page.size: 4MB

Repeat test(use memory as tieredstore0 (quota 100G)):

  1. Both worker and fuse are on node A: readBW: 510.9MB/s, writeBW: 368.05MB/s

  2. The worker is on node A, and the fuse is on node B: readBW: 605.78MB/s, writeBW: 422.69MB/s

Then use nvme as tieredstore0 (quota 100G):

  1. Both worker and fuse are on node A: readBW: 1067MB/s, writeBW: 697.71MB/s

  2. The worker is on node A, and the fuse is on node B: readBW: 1079MB/s, writeBW: 1077.1MB/s

Question1: Why is remote reading and writing faster than short-circuit reading and writing? When I replace the nvme disk for caching, the same result remains, and cross-node reading and writing is higher than short-circuit reading and writing. ps: My network bandwidth is very high, it will not affect the transmission.

Question2: Why is the acceleration effect based on nvme higher than that of memory?

configuration of woker:

alluxio.integration.worker.resource.cpu=1
alluxio.integration.worker.resource.mem=1024MB
alluxio.integration.yarn.workers.per.host.max=1
alluxio.job.master.lost.worker.interval=1sec
alluxio.job.master.worker.heartbeat.interval=1sec
alluxio.job.master.worker.timeout=60sec
alluxio.job.worker.bind.host=0.0.0.0
alluxio.job.worker.data.port=22527
alluxio.job.worker.hostname=
alluxio.job.worker.rpc.port=22956
alluxio.job.worker.threadpool.size=32
alluxio.job.worker.throttling=false
alluxio.job.worker.web.bind.host=0.0.0.0
alluxio.job.worker.web.port=20983
alluxio.master.lost.worker.detection.interval=10sec
alluxio.master.lost.worker.file.detection.interval=5min
alluxio.master.worker.connect.wait.time=5sec
alluxio.master.worker.info.cache.refresh.time=10sec
alluxio.master.worker.register.lease.count=25
alluxio.master.worker.register.lease.enabled=true
alluxio.master.worker.register.lease.respect.jvm.space=true
alluxio.master.worker.register.lease.ttl=1min
alluxio.master.worker.register.stream.response.timeout=10min
alluxio.master.worker.timeout=5min
alluxio.user.block.worker.client.pool.gc.threshold=300sec
alluxio.user.block.worker.client.pool.max=1024
alluxio.user.block.worker.client.pool.min=512
alluxio.user.network.netty.worker.threads=
alluxio.user.network.rpc.netty.worker.threads=0
alluxio.user.network.streaming.netty.worker.threads=0
alluxio.user.worker.list.refresh.interval=2min
alluxio.worker.allocator.class=alluxio.worker.block.allocator.MaxFreeAllocator
alluxio.worker.bind.host=0.0.0.0
alluxio.worker.block.annotator.class=alluxio.worker.block.annotator.LRUAnnotator
alluxio.worker.block.annotator.lrfu.attenuation.factor=2.0
alluxio.worker.block.annotator.lrfu.step.factor=0.25
alluxio.worker.block.heartbeat.interval=1sec
alluxio.worker.block.heartbeat.timeout=${alluxio.worker.master.connect.retry.timeout}
alluxio.worker.block.master.client.pool.size=1024
alluxio.worker.block.store.type=FILE
alluxio.worker.container.hostname=
alluxio.worker.data.folder=/alluxioworker/
alluxio.worker.data.folder.permissions=rwxrwxrwx
alluxio.worker.data.folder.tmp=.tmp_blocks
alluxio.worker.data.server.domain.socket.address=
alluxio.worker.data.server.domain.socket.as.uuid=false
alluxio.worker.data.tmp.subdir.max=1024
alluxio.worker.evictor.class=
alluxio.worker.free.space.timeout=10sec
alluxio.worker.fuse.enabled=false
alluxio.worker.hostname=
alluxio.worker.jvm.monitor.enabled=true
alluxio.worker.keytab.file=
alluxio.worker.management.backoff.strategy=ANY
alluxio.worker.management.block.transfer.concurrency.limit=2
alluxio.worker.management.load.detection.cool.down.time=10sec
alluxio.worker.management.task.thread.count=4
alluxio.worker.management.tier.align.enabled=true
alluxio.worker.management.tier.align.range=100
alluxio.worker.management.tier.align.reserved.bytes=1GB
alluxio.worker.management.tier.promote.enabled=true
alluxio.worker.management.tier.promote.quota.percent=90
alluxio.worker.management.tier.promote.range=100
alluxio.worker.management.tier.swap.restore.enabled=true
alluxio.worker.master.connect.retry.timeout=1hour
alluxio.worker.master.periodical.rpc.timeout=5min
alluxio.worker.network.async.cache.manager.queue.max=512
alluxio.worker.network.async.cache.manager.threads.max=8
alluxio.worker.network.block.reader.threads.max=2048
alluxio.worker.network.block.writer.threads.max=1024
alluxio.worker.network.flowcontrol.window=2MB
alluxio.worker.network.keepalive.time=30sec
alluxio.worker.network.keepalive.timeout=30sec
alluxio.worker.network.max.inbound.message.size=4MB
alluxio.worker.network.netty.boss.threads=1
alluxio.worker.network.netty.channel=EPOLL
alluxio.worker.network.netty.shutdown.quiet.period=2sec
alluxio.worker.network.netty.watermark.high=32KB
alluxio.worker.network.netty.watermark.low=8KB
alluxio.worker.network.netty.worker.threads=4
alluxio.worker.network.permit.keepalive.time=30s
alluxio.worker.network.reader.buffer.pooled=true
alluxio.worker.network.reader.buffer.size=32MB
alluxio.worker.network.reader.max.chunk.size.bytes=2MB
alluxio.worker.network.shutdown.timeout=15sec
alluxio.worker.network.writer.buffer.size.messages=8
alluxio.worker.network.zerocopy.enabled=true
alluxio.worker.page.store.async.restore.enabled=true
alluxio.worker.page.store.async.write.enabled=false
alluxio.worker.page.store.async.write.threads=16
alluxio.worker.page.store.dirs=/tmp/alluxio_cache
alluxio.worker.page.store.eviction.retries=10
alluxio.worker.page.store.evictor.class=alluxio.client.file.cache.evictor.LRUCacheEvictor
alluxio.worker.page.store.evictor.lfu.logbase=2.0
alluxio.worker.page.store.evictor.nondeterministic.enabled=false
alluxio.worker.page.store.local.store.file.buckets=1000
alluxio.worker.page.store.overhead=0.1
alluxio.worker.page.store.page.size=1MB
alluxio.worker.page.store.quota.enabled=false
alluxio.worker.page.store.sizes=512MB
alluxio.worker.page.store.timeout.duration=-1
alluxio.worker.page.store.timeout.threads=32
alluxio.worker.page.store.type=LOCAL
alluxio.worker.principal=
alluxio.worker.ramdisk.size=360332323498
alluxio.worker.register.lease.enabled=${alluxio.master.worker.register.lease.enabled}
alluxio.worker.register.lease.retry.max.duration=${alluxio.worker.master.connect.retry.timeout}
alluxio.worker.register.lease.retry.sleep.max=10sec
alluxio.worker.register.lease.retry.sleep.min=1sec
alluxio.worker.register.stream.batch.size=1000000
alluxio.worker.register.stream.complete.timeout=5min
alluxio.worker.register.stream.deadline=15min
alluxio.worker.register.stream.enabled=true
alluxio.worker.register.stream.response.timeout=${alluxio.master.worker.register.stream.response.timeout}
alluxio.worker.remote.io.slow.threshold=10s
alluxio.worker.reviewer.class=alluxio.worker.block.reviewer.ProbabilisticBufferReviewer
alluxio.worker.reviewer.probabilistic.hardlimit.bytes=64MB
alluxio.worker.reviewer.probabilistic.softlimit.bytes=256MB
alluxio.worker.rpc.executor.core.pool.size=100
alluxio.worker.rpc.executor.fjp.async=true
alluxio.worker.rpc.executor.fjp.min.runnable=1
alluxio.worker.rpc.executor.fjp.parallelism=8
alluxio.worker.rpc.executor.keepalive=60sec
alluxio.worker.rpc.executor.max.pool.size=1000
alluxio.worker.rpc.executor.tpe.allow.core.threads.timeout=true
alluxio.worker.rpc.executor.tpe.queue.type=LINKED_BLOCKING_QUEUE_WITH_CAP
alluxio.worker.rpc.executor.type=TPE
alluxio.worker.rpc.port=22668
alluxio.worker.session.timeout=1min
alluxio.worker.startup.timeout=10min
alluxio.worker.storage.checker.enabled=true
alluxio.worker.tieredstore.block.lock.readers=1000
alluxio.worker.tieredstore.block.locks=1000
alluxio.worker.tieredstore.free.ahead.bytes=0
alluxio.worker.tieredstore.level0.alias=SSD
alluxio.worker.tieredstore.level0.dirs.mediumtype=SSD
alluxio.worker.tieredstore.level0.dirs.path=/mnt/data/fluid-alluxio/fluid-test/alluxio-s3
alluxio.worker.tieredstore.level0.dirs.quota=100GB
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level0.watermark.low.ratio=0.7
alluxio.worker.tieredstore.level1.alias=
alluxio.worker.tieredstore.level1.dirs.mediumtype=
alluxio.worker.tieredstore.level1.dirs.path=
alluxio.worker.tieredstore.level1.dirs.quota=
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level1.watermark.low.ratio=0.7
alluxio.worker.tieredstore.level2.alias=
alluxio.worker.tieredstore.level2.dirs.mediumtype=
alluxio.worker.tieredstore.level2.dirs.path=
alluxio.worker.tieredstore.level2.dirs.quota=
alluxio.worker.tieredstore.level2.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level2.watermark.low.ratio=0.7
alluxio.worker.tieredstore.levels=1
alluxio.worker.ufs.block.open.timeout=5min
alluxio.worker.ufs.instream.cache.enabled=true
alluxio.worker.ufs.instream.cache.expiration.time=5min
alluxio.worker.ufs.instream.cache.max.size=5000
alluxio.worker.web.bind.host=0.0.0.0
alluxio.worker.web.hostname=
alluxio.worker.web.port=22314
alluxio.worker.whitelist=/

Describe the solution you'd like Can someone explain the reason for me? I read that the alluxio documentation says that short-circuit read and write performance is the highest.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Urgency Explain why the feature is important

Additional context Add any other context or screenshots about the feature request here.

jja725 commented 1 year ago

Hi, short-circuit with k8s is really not supported. Also short-circuit read/write is a deprecated functionality so we would not recommend to use it any more

wangw-david commented 1 year ago

@jja725 Thanks for your answer, I want to know what conditions are needed for short-circuit reading and writing? I see that the worker and fuse have mounted the same cache directory, both of which use hostNetwork, and the fuse container uses a privileged mode. According to the monitoring data, it is read and write locally.

For short-circuit read and write to be discarded, do you know which version it starts from? I see that the latest 2.9.3 document does not show this.

wangw-david commented 1 year ago

@Kai-Zhang Hello, Mr. Zhang, I have supplemented the configuration of the worker. If use fluid to deploy alluxio, these are basically the default configurations.