Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.86k stars 2.94k forks source link

High CPU usage on Master node if the cache had been filled up to 40GB #11244

Closed phuong-leeo closed 1 year ago

phuong-leeo commented 4 years ago

Alluxio Version: V2.2.0

AWS EMR version: 5.25.0

Presto version: 0.220

Cluster specs 1: Master c5.4xlarge 1: Core r5a.xlarge 10: Task r5a.xlarge

Describe the bug After Alluxio had been cached approx 40GB data. CPU started getting high load on Master node we thought it was caused by storing metastore on ROCKS, we then turnover to HEAP, but it did not help us solve this issue. We did not face kinda this situation on Alluxio v1.8

Error log

2020-04-04 14:15:11,027 WARN  MasterJournalContext - Journal flush failed. retrying...
java.io.IOException: Timed out after waiting 30000 milliseconds for journal entries to be processed
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:102)
        at alluxio.master.journal.AsyncJournalWriter.doFlush(AsyncJournalWriter.java:295)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at io.atomix.catalyst.concurrent.BlockingFuture.get(BlockingFuture.java:54)
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:93)
        ... 2 more
2020-04-04 14:15:11,029 ERROR MasterJournalContext - Fatal error: Journal flush failed after 9 attempts
2020-04-04 14:15:11,029 ERROR MasterJournalContext - Fatal error: Journal flush failed after 7 attempts
2020-04-04 14:15:11,029 ERROR MasterJournalContext - Fatal error: Journal flush failed after 7 attempts
2020-04-04 14:15:11,029 WARN  MasterJournalContext - Journal flush failed. retrying...
java.io.IOException: Timed out after waiting 30000 milliseconds for journal entries to be processed
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:102)
        at alluxio.master.journal.AsyncJournalWriter.doFlush(AsyncJournalWriter.java:295)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at io.atomix.catalyst.concurrent.BlockingFuture.get(BlockingFuture.java:54)
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:93)
        ... 2 more
2020-04-04 14:15:11,031 WARN  SleepingTimer - Master Lost Files Detection last execution took 19747 ms. Longer than the interval 10000
2020-04-04 14:15:11,031 WARN  SleepingTimer - Master Lost Worker Detection last execution took 19747 ms. Longer than the interval 10000
2020-04-04 14:15:14,813 ERROR MasterJournalContext - Fatal error: Journal flush failed after 8 attempts
2020-04-04 14:15:51,541 WARN  ServletHandler - Error for /metrics/prometheus/
java.lang.OutOfMemoryError: GC overhead limit exceeded
2020-04-04 14:15:59,798 INFO  BackupLeaderRole - Closing backup-leader role.
2020-04-04 14:16:17,802 WARN  MasterJournalContext - Journal flush failed. retrying...
java.io.IOException: Timed out after waiting 30000 milliseconds for journal entries to be processed
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:102)
        at alluxio.master.journal.AsyncJournalWriter.doFlush(AsyncJournalWriter.java:295)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at io.atomix.catalyst.concurrent.BlockingFuture.get(BlockingFuture.java:54)
        at alluxio.master.journal.raft.RaftJournalWriter.flush(RaftJournalWriter.java:93)

alluxio-site.properties

alluxio.conf.dir=/opt/alluxio-2.2.0/conf
alluxio.conf.validation.enabled=false
alluxio.debug=false
alluxio.extensions.dir=/opt/alluxio-2.2.0/extensions
alluxio.fuse.cached.paths.max=500
alluxio.fuse.debug.enabled=false
alluxio.fuse.fs.name=alluxio-fuse
alluxio.fuse.maxwrite.bytes=128KB
alluxio.fuse.user.group.translation.enabled=false
alluxio.home=/opt/alluxio-2.2.0
alluxio.integration.master.resource.cpu=1
alluxio.integration.master.resource.mem=1024MB
alluxio.integration.mesos.alluxio.jar.url=http://downloads.alluxio.io/downloads/files/2.2.0/alluxio-2.2.0-bin.tar.gz
alluxio.integration.mesos.jdk.path=jdk1.8.0_151
alluxio.integration.mesos.jdk.url=LOCAL
alluxio.integration.mesos.master.name=AlluxioMaster
alluxio.integration.mesos.master.node.count=1
alluxio.integration.mesos.principal=alluxio
alluxio.integration.mesos.role=*
alluxio.integration.mesos.secret=
alluxio.integration.mesos.user=
alluxio.integration.mesos.worker.name=AlluxioWorker
alluxio.integration.worker.resource.cpu=1
alluxio.integration.worker.resource.mem=1024MB
alluxio.integration.yarn.workers.per.host.max=1
alluxio.job.master.bind.host=0.0.0.0
alluxio.job.master.client.threads=1024
alluxio.job.master.embedded.journal.addresses=
alluxio.job.master.embedded.journal.port=20003
alluxio.job.master.finished.job.purge.count=-1
alluxio.job.master.finished.job.retention.time=300sec
alluxio.job.master.hostname=10.53.0.142
alluxio.job.master.job.capacity=100000
alluxio.job.master.lost.worker.interval=1sec
alluxio.job.master.rpc.addresses=
alluxio.job.master.rpc.port=20001
alluxio.job.master.web.bind.host=0.0.0.0
alluxio.job.master.web.hostname=10.53.0.142
alluxio.job.master.web.port=20002
alluxio.job.master.worker.heartbeat.interval=1sec
alluxio.job.master.worker.timeout=60sec
alluxio.job.worker.bind.host=0.0.0.0
alluxio.job.worker.data.port=30002
alluxio.job.worker.hostname=
alluxio.job.worker.rpc.port=30001
alluxio.job.worker.threadpool.size=10
alluxio.job.worker.throttling=false
alluxio.job.worker.web.bind.host=0.0.0.0
alluxio.job.worker.web.port=30003
alluxio.jvm.monitor.info.threshold=1sec
alluxio.jvm.monitor.sleep.interval=1sec
alluxio.jvm.monitor.warn.threshold=10sec
alluxio.locality.compare.node.ip=false
alluxio.locality.node=
alluxio.locality.order=node,rack
alluxio.locality.rack=
alluxio.locality.script=alluxio-locality.sh
alluxio.logger.type=USER_LOGGER
alluxio.logs.dir=/opt/alluxio-2.2.0/logs
alluxio.logserver.hostname=
alluxio.logserver.logs.dir=/opt/alluxio-2.2.0/logs
alluxio.logserver.port=45600
alluxio.logserver.threads.max=2048
alluxio.logserver.threads.min=512
alluxio.master.audit.logging.enabled=false
alluxio.master.audit.logging.queue.capacity=10000
alluxio.master.backup.abandon.timeout=2min
alluxio.master.backup.connect.interval.max=10sec
alluxio.master.backup.connect.interval.min=1sec
alluxio.master.backup.delegation.enabled=false
alluxio.master.backup.directory=/alluxio_backups
alluxio.master.backup.entry.buffer.count=10000
alluxio.master.backup.heartbeat.interval=1sec
alluxio.master.backup.transport.timeout=5sec
alluxio.master.bind.host=0.0.0.0
alluxio.master.cluster.metrics.update.interval=1min
alluxio.master.daily.backup.enabled=false
alluxio.master.daily.backup.files.retained=3
alluxio.master.daily.backup.time=05:00
alluxio.master.embedded.journal.addresses=
alluxio.master.embedded.journal.appender.batch.size=512KB
alluxio.master.embedded.journal.bind.host=
alluxio.master.embedded.journal.election.timeout=10s
alluxio.master.embedded.journal.heartbeat.interval=3s
alluxio.master.embedded.journal.port=19200
alluxio.master.embedded.journal.shutdown.timeout=10sec
alluxio.master.embedded.journal.storage.level=DISK
alluxio.master.embedded.journal.transport.max.inbound.message.size=100MB
alluxio.master.embedded.journal.transport.request.timeout.ms=5sec
alluxio.master.embedded.journal.triggered.snapshot.wait.timeout=2hour
alluxio.master.embedded.journal.write.timeout=30sec
alluxio.master.file.access.time.journal.flush.interval=1h
alluxio.master.file.access.time.update.precision=1d
alluxio.master.file.access.time.updater.shutdown.timeout=1sec
alluxio.master.filesystem.liststatus.result.message.length=10000
alluxio.master.format.file.prefix=_format_
alluxio.master.heartbeat.timeout=10min
alluxio.master.hostname=10.53.0.142
alluxio.master.journal.checkpoint.period.entries=2000000
alluxio.master.journal.flush.batch.time=5ms
alluxio.master.journal.flush.timeout=5min
alluxio.master.journal.folder=/opt/alluxio-2.2.0/journal
alluxio.master.journal.gc.period=2min
alluxio.master.journal.gc.threshold=5min
alluxio.master.journal.init.from.backup=
alluxio.master.journal.log.size.bytes.max=10MB
alluxio.master.journal.retry.interval=1sec
alluxio.master.journal.tailer.shutdown.quiet.wait.time=5sec
alluxio.master.journal.tailer.sleep.time=1sec
alluxio.master.journal.temporary.file.gc.threshold=30min
alluxio.master.journal.tolerate.corruption=false
alluxio.master.journal.type=EMBEDDED
alluxio.master.journal.ufs.option=
alluxio.master.jvm.monitor.enabled=false
alluxio.master.keytab.file=
alluxio.master.lock.pool.concurrency.level=100
alluxio.master.lock.pool.high.watermark=1000000
alluxio.master.lock.pool.initsize=1000
alluxio.master.lock.pool.low.watermark=500000
alluxio.master.log.config.report.heartbeat.interval=1h
alluxio.master.lost.worker.file.detection.interval=10sec
alluxio.master.metastore=HEAP
alluxio.master.metastore.dir=/opt/alluxio-2.2.0/metastore
alluxio.master.metastore.inode.cache.evict.batch.size=1000
alluxio.master.metastore.inode.cache.high.water.mark.ratio=0.85
alluxio.master.metastore.inode.cache.low.water.mark.ratio=0.8
alluxio.master.metastore.inode.cache.max.size=10000000
alluxio.master.metastore.inode.enumerator.buffer.count=10000
alluxio.master.metastore.inode.inherit.owner.and.group=true
alluxio.master.metastore.inode.iteration.crawler.count=8
alluxio.master.metastore.iterator.readahead.size=64MB
alluxio.master.metrics.service.threads=5
alluxio.master.metrics.time.series.interval=5min
alluxio.master.mount.table.root.alluxio=/
alluxio.master.mount.table.root.option=
alluxio.master.mount.table.root.readonly=false
alluxio.master.mount.table.root.shared=true
alluxio.master.mount.table.root.ufs=/opt/alluxio-2.2.0/underFSStorage
alluxio.master.periodic.block.integrity.check.interval=1hr
alluxio.master.periodic.block.integrity.check.repair=false
alluxio.master.persistence.blacklist=
alluxio.master.persistence.checker.interval=1s
alluxio.master.persistence.initial.interval=1s
alluxio.master.persistence.max.interval=1hr
alluxio.master.persistence.max.total.wait.time=1day
alluxio.master.persistence.scheduler.interval=1s
alluxio.master.principal=
alluxio.master.replication.check.interval=1min
alluxio.master.rpc.addresses=
alluxio.master.rpc.executor.core.pool.size=0
alluxio.master.rpc.executor.keepalive=60sec
alluxio.master.rpc.executor.max.pool.size=500
alluxio.master.rpc.executor.min.runnable=1
alluxio.master.rpc.executor.parallelism=16
alluxio.master.rpc.port=19998
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.serving.thread.timeout=5m
alluxio.master.skip.root.acl.check=false
alluxio.master.standby.heartbeat.interval=2min
alluxio.master.startup.block.integrity.check.enabled=true
alluxio.master.tieredstore.global.level0.alias=MEM
alluxio.master.tieredstore.global.level1.alias=SSD
alluxio.master.tieredstore.global.level2.alias=HDD
alluxio.master.tieredstore.global.levels=3
alluxio.master.tieredstore.global.mediumtype=MEM, SSD, HDD
alluxio.master.ttl.checker.interval=1hour
alluxio.master.ufs.active.sync.event.rate.interval=60sec
alluxio.master.ufs.active.sync.initial.sync.enabled=true
alluxio.master.ufs.active.sync.interval=30sec
alluxio.master.ufs.active.sync.max.activities=10
alluxio.master.ufs.active.sync.max.age=10
alluxio.master.ufs.active.sync.poll.timeout=10sec
alluxio.master.ufs.active.sync.retry.timeout=10sec
alluxio.master.ufs.active.sync.thread.pool.size=3
alluxio.master.ufs.block.location.cache.capacity=1000000
alluxio.master.ufs.managed.blocking.enabled=
alluxio.master.ufs.path.cache.capacity=100000
alluxio.master.ufs.path.cache.threads=64
alluxio.master.unsafe.direct.persist.object.enabled=true
alluxio.master.update.check.enabled=true
alluxio.master.update.check.interval=7day
alluxio.master.web.bind.host=0.0.0.0
alluxio.master.web.hostname=
alluxio.master.web.port=19999
alluxio.master.whitelist=/
alluxio.master.worker.connect.wait.time=5sec
alluxio.master.worker.info.cache.refresh.time=10sec
alluxio.master.worker.timeout=5min
alluxio.metrics.conf.file=/opt/alluxio-2.2.0/conf/metrics.properties
alluxio.metrics.context.shutdown.timeout=1sec
alluxio.network.connection.auth.timeout=30sec
alluxio.network.connection.health.check.timeout=5sec
alluxio.network.connection.server.shutdown.timeout=60sec
alluxio.network.connection.shutdown.timeout=60sec
alluxio.network.host.resolution.timeout=5sec
alluxio.proxy.s3.deletetype=ALLUXIO_AND_UFS
alluxio.proxy.s3.multipart.temporary.dir.suffix=_s3_multipart_tmp
alluxio.proxy.s3.writetype=CACHE_THROUGH
alluxio.proxy.stream.cache.timeout=1hour
alluxio.proxy.web.bind.host=0.0.0.0
alluxio.proxy.web.hostname=
alluxio.proxy.web.port=39999
alluxio.secondary.master.metastore.dir=/opt/alluxio-2.2.0/secondary-metastore
alluxio.security.authentication.custom.provider.class=
alluxio.security.authentication.type=SIMPLE
alluxio.security.authorization.permission.enabled=true
alluxio.security.authorization.permission.supergroup=supergroup
alluxio.security.authorization.permission.umask=022
alluxio.security.group.mapping.cache.timeout=1min
alluxio.security.group.mapping.class=alluxio.security.group.provider.ShellBasedUnixGroupsMapping
alluxio.security.login.impersonation.username=_NONE_
alluxio.security.login.username=
alluxio.security.stale.channel.purge.interval=3day
alluxio.site.conf.dir=/opt/alluxio-2.2.0/conf/,/home/hadoop/.alluxio/,/etc/alluxio/
alluxio.table.catalog.path=/catalog
alluxio.table.catalog.udb.sync.timeout=1h
alluxio.table.enabled=true
alluxio.table.transform.manager.job.history.retention.time=300sec
alluxio.table.transform.manager.job.monitor.interval=10000
alluxio.test.deprecated.key=
alluxio.test.mode=false
alluxio.tmp.dirs=/tmp
alluxio.underfs.allow.set.owner.failure=false
alluxio.underfs.cleanup.enabled=false
alluxio.underfs.cleanup.interval=1day
alluxio.underfs.eventual.consistency.retry.base.sleep=50ms
alluxio.underfs.eventual.consistency.retry.max.num=20
alluxio.underfs.eventual.consistency.retry.max.sleep=30sec
alluxio.underfs.gcs.default.mode=0700
alluxio.underfs.gcs.directory.suffix=/
alluxio.underfs.gcs.owner.id.to.username.mapping=
alluxio.underfs.hdfs.configuration=/opt/alluxio-2.2.0/conf/core-site.xml:/opt/alluxio-2.2.0/conf/hdfs-site.xml
alluxio.underfs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem
alluxio.underfs.hdfs.prefixes=hdfs://,glusterfs:///
alluxio.underfs.hdfs.remote=false
alluxio.underfs.kodo.connect.timeout=50sec
alluxio.underfs.kodo.downloadhost=
alluxio.underfs.kodo.endpoint=
alluxio.underfs.kodo.requests.max=64
alluxio.underfs.listing.length=1000
alluxio.underfs.object.store.breadcrumbs.enabled=true
alluxio.underfs.object.store.mount.shared.publicly=true
alluxio.underfs.object.store.multi.range.chunk.size=128MB
alluxio.underfs.object.store.service.threads=20
alluxio.underfs.oss.connection.max=1024
alluxio.underfs.oss.connection.timeout=50sec
alluxio.underfs.oss.connection.ttl=-1
alluxio.underfs.oss.socket.timeout=50sec
alluxio.underfs.s3.admin.threads.max=20
alluxio.underfs.s3.bulk.delete.enabled=true
alluxio.underfs.s3.default.mode=0777
alluxio.underfs.s3.directory.suffix=/
alluxio.underfs.s3.disable.dns.buckets=false
alluxio.underfs.s3.endpoint=s3.eu-central-1.amazonaws.com
alluxio.underfs.s3.inherit.acl=false
alluxio.underfs.s3.intermediate.upload.clean.age=3day
alluxio.underfs.s3.list.objects.v1=false
alluxio.underfs.s3.owner.id.to.username.mapping=
alluxio.underfs.s3.proxy.host=
alluxio.underfs.s3.proxy.port=
alluxio.underfs.s3.request.timeout=1min
alluxio.underfs.s3.secure.http.enabled=false
alluxio.underfs.s3.server.side.encryption.enabled=false
alluxio.underfs.s3.signer.algorithm=
alluxio.underfs.s3.socket.timeout=50sec
alluxio.underfs.s3.streaming.upload.enabled=false
alluxio.underfs.s3.streaming.upload.partition.size=64MB
alluxio.underfs.s3.threads.max=40
alluxio.underfs.s3.upload.threads.max=20
alluxio.underfs.version=2.7
alluxio.underfs.web.connnection.timeout=60s
alluxio.underfs.web.header.last.modified=EEE, dd MMM yyyy HH:mm:ss zzz
alluxio.underfs.web.parent.names=Parent Directory,..,../
alluxio.underfs.web.titles=Index of,Directory listing for
alluxio.user.app.id=
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes=0MB
alluxio.user.block.master.client.pool.gc.interval=120sec
alluxio.user.block.master.client.pool.gc.threshold=120sec
alluxio.user.block.master.client.pool.size.max=10
alluxio.user.block.master.client.pool.size.min=0
alluxio.user.block.remote.read.buffer.size.bytes=8MB
alluxio.user.block.size.bytes.default=128MB
alluxio.user.block.worker.client.pool.gc.threshold=300sec
alluxio.user.block.worker.client.pool.size=1024
alluxio.user.block.worker.client.read.retry=5
alluxio.user.block.write.location.policy.class=alluxio.client.block.policy.LocalFirstPolicy
alluxio.user.client.cache.dir=/tmp/alluxio_cache
alluxio.user.client.cache.enabled=false
alluxio.user.client.cache.evictor.class=alluxio.client.file.cache.evictor.LRUCacheEvictor
alluxio.user.client.cache.local.store.file.buckets=1000
alluxio.user.client.cache.page.size=1MB
alluxio.user.client.cache.size=512MB
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.conf.cluster.default.enabled=true
alluxio.user.conf.sync.interval=1min
alluxio.user.date.format.pattern=MM-dd-yyyy HH:mm:ss:SSS
alluxio.user.file.buffer.bytes=8MB
alluxio.user.file.copyfromlocal.block.location.policy.class=alluxio.client.block.policy.RoundRobinPolicy
alluxio.user.file.create.ttl=-1
alluxio.user.file.create.ttl.action=DELETE
alluxio.user.file.delete.unchecked=false
alluxio.user.file.master.client.pool.gc.interval=120sec
alluxio.user.file.master.client.pool.gc.threshold=10
alluxio.user.file.master.client.pool.size.max=10
alluxio.user.file.master.client.pool.size.min=0
alluxio.user.file.metadata.load.type=ONCE
alluxio.user.file.metadata.sync.interval=-1
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.persist.on.rename=false
alluxio.user.file.persistence.initial.wait.time=0
alluxio.user.file.readtype.default=CACHE_PROMOTE
alluxio.user.file.replication.durable=1
alluxio.user.file.replication.max=-1
alluxio.user.file.replication.min=0
alluxio.user.file.sequential.pread.threshold=2MB
alluxio.user.file.ufs.tier.enabled=false
alluxio.user.file.waitcompleted.poll=1sec
alluxio.user.file.write.tier.default=0
alluxio.user.file.writetype.default=MUST_CACHE
alluxio.user.hostname=
alluxio.user.local.reader.chunk.size.bytes=8MB
alluxio.user.local.writer.chunk.size.bytes=64KB
alluxio.user.logs.dir=/opt/alluxio-2.2.0/logs/user
alluxio.user.metadata.cache.enabled=false
alluxio.user.metadata.cache.expiration.time=10min
alluxio.user.metadata.cache.max.size=100000
alluxio.user.metrics.collection.enabled=false
alluxio.user.metrics.heartbeat.interval=10sec
alluxio.user.network.data.timeout=30sec
alluxio.user.network.flowcontrol.window=2MB
alluxio.user.network.keepalive.time=9223372036854775807
alluxio.user.network.keepalive.timeout=30sec
alluxio.user.network.max.inbound.message.size=100MB
alluxio.user.network.netty.channel=EPOLL
alluxio.user.network.netty.worker.threads=0
alluxio.user.network.reader.buffer.size.messages=16
alluxio.user.network.reader.chunk.size.bytes=1MB
alluxio.user.network.writer.buffer.size.messages=16
alluxio.user.network.writer.chunk.size.bytes=1MB
alluxio.user.network.writer.close.timeout=30min
alluxio.user.network.writer.flush.timeout=30min
alluxio.user.network.zerocopy.enabled=true
alluxio.user.rpc.retry.base.sleep=50ms
alluxio.user.rpc.retry.max.duration=2min
alluxio.user.rpc.retry.max.sleep=3sec
alluxio.user.short.circuit.enabled=true
alluxio.user.short.circuit.preferred=false
alluxio.user.ufs.block.location.all.fallback.enabled=true
alluxio.user.ufs.block.read.concurrency.max=2147483647
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.LocalFirstPolicy
alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=1
alluxio.user.worker.list.refresh.interval=2min
alluxio.version=2.2.0
alluxio.web.cors.enabled=false
alluxio.web.file.info.enabled=true
alluxio.web.refresh.interval=15s
alluxio.web.resources=/opt/alluxio-2.2.0/webui/
alluxio.web.threads=1
alluxio.work.dir=/opt/alluxio-2.2.0
alluxio.worker.allocator.class=alluxio.worker.block.allocator.MaxFreeAllocator
alluxio.worker.bind.host=0.0.0.0
alluxio.worker.block.heartbeat.interval=1sec
alluxio.worker.block.heartbeat.timeout=1hour
alluxio.worker.block.master.client.pool.size=11
alluxio.worker.data.folder=/alluxioworker/
alluxio.worker.data.folder.permissions=rwxrwxrwx
alluxio.worker.data.folder.tmp=.tmp_blocks
alluxio.worker.data.server.class=alluxio.worker.grpc.GrpcDataServer
alluxio.worker.data.server.domain.socket.address=
alluxio.worker.data.server.domain.socket.as.uuid=false
alluxio.worker.data.tmp.subdir.max=1024
alluxio.worker.evictor.class=alluxio.worker.block.evictor.LRUEvictor
alluxio.worker.evictor.lrfu.attenuation.factor=2.0
alluxio.worker.evictor.lrfu.step.factor=0.25
alluxio.worker.file.buffer.size=1MB
alluxio.worker.free.space.timeout=10sec
alluxio.worker.hostname=
alluxio.worker.jvm.monitor.enabled=false
alluxio.worker.keytab.file=
alluxio.worker.master.connect.retry.timeout=1hour
alluxio.worker.memory.size=1G
alluxio.worker.network.async.cache.manager.threads.max=8
alluxio.worker.network.block.reader.threads.max=2048
alluxio.worker.network.block.writer.threads.max=1024
alluxio.worker.network.flowcontrol.window=2MB
alluxio.worker.network.keepalive.time=30sec
alluxio.worker.network.keepalive.timeout=30sec
alluxio.worker.network.max.inbound.message.size=1GB
alluxio.worker.network.netty.boss.threads=1
alluxio.worker.network.netty.channel=EPOLL
alluxio.worker.network.netty.shutdown.quiet.period=2sec
alluxio.worker.network.netty.watermark.high=32KB
alluxio.worker.network.netty.watermark.low=8KB
alluxio.worker.network.netty.worker.threads=0
alluxio.worker.network.reader.buffer.size=4MB
alluxio.worker.network.reader.max.chunk.size.bytes=2MB
alluxio.worker.network.shutdown.timeout=15sec
alluxio.worker.network.writer.buffer.size.messages=8
alluxio.worker.network.zerocopy.enabled=true
alluxio.worker.principal=
alluxio.worker.rpc.port=29999
alluxio.worker.session.timeout=1min
alluxio.worker.storage.checker.enabled=true
alluxio.worker.tieredstore.block.lock.readers=1000
alluxio.worker.tieredstore.block.locks=1000
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM,SSD
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk,/mnt/ssd1
alluxio.worker.tieredstore.level0.dirs.quota=1G,20G
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level0.watermark.low.ratio=0.7
alluxio.worker.tieredstore.level1.alias=
alluxio.worker.tieredstore.level1.dirs.mediumtype=
alluxio.worker.tieredstore.level1.dirs.path=
alluxio.worker.tieredstore.level1.dirs.quota=
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level1.watermark.low.ratio=0.7
alluxio.worker.tieredstore.level2.alias=
alluxio.worker.tieredstore.level2.dirs.mediumtype=
alluxio.worker.tieredstore.level2.dirs.path=
alluxio.worker.tieredstore.level2.dirs.quota=
alluxio.worker.tieredstore.level2.watermark.high.ratio=0.95
alluxio.worker.tieredstore.level2.watermark.low.ratio=0.7
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.reserver.interval=1sec
alluxio.worker.ufs.block.open.timeout=5min
alluxio.worker.ufs.instream.cache.enabled=true
alluxio.worker.ufs.instream.cache.expiration.time=5min
alluxio.worker.ufs.instream.cache.max.size=5000
alluxio.worker.web.bind.host=0.0.0.0
alluxio.worker.web.hostname=
alluxio.worker.web.port=30000
alluxio.zookeeper.address=
alluxio.zookeeper.auth.enabled=true
alluxio.zookeeper.connection.timeout=15s
alluxio.zookeeper.election.path=/alluxio/election
alluxio.zookeeper.enabled=false
alluxio.zookeeper.job.election.path=/job_election
alluxio.zookeeper.job.leader.path=/job_leader
alluxio.zookeeper.leader.connection.error.policy=SESSION
alluxio.zookeeper.leader.inquiry.retry=10
alluxio.zookeeper.leader.path=/alluxio/leader
alluxio.zookeeper.session.timeout=60s
aws.accessKeyId=
aws.secretKey=
fs.cos.access.key=
fs.cos.app.id=
fs.cos.connection.max=1024
fs.cos.connection.timeout=50sec
fs.cos.region=
fs.cos.secret.key=
fs.cos.socket.timeout=50sec
fs.gcs.accessKeyId=
fs.gcs.secretAccessKey=
fs.kodo.accesskey=
fs.kodo.secretkey=
fs.oss.accessKeyId=
fs.oss.accessKeySecret=
fs.oss.endpoint=
fs.swift.auth.method=
fs.swift.auth.url=
fs.swift.password=
fs.swift.region=
fs.swift.simulation=
fs.swift.tenant=
fs.swift.user=
alluxio-capacity load-ave

To Reproduce Filled the cache at least 40GB

Expected behavior reduce CPU load on Master node, firgure out why Journal flush failed happened

Urgency high

ZacBlanco commented 4 years ago

@phuong-leeo , if you are still experiencing this issue, could you take a jstack of the master process while the CPU is high.

Even better, you could try to create a flamegraph

A jstack is probably the easiest method right now.

phuong-leeo commented 4 years ago

tks @ZacBlanco , I should give it a try

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

jja725 commented 1 year ago

Will close it for now, feel free to reopen it and contact us if this is still valid.