Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.85k stars 2.94k forks source link

When spark begin to write to alluxio,can not connect to worker #13930

Closed lilyzhoupeijie closed 3 years ago

lilyzhoupeijie commented 3 years ago

Alluxio Version: 2.6.0

Describe the bug we use spark to write to alluxio,and after writing for a few minutes,the spark begin to fail spark as follows: from future import print_function

import sys from random import random from operator import add

from pyspark.sql import SparkSession import datetime

if name == "main":

pathSa3="alluxio://graytest1-master-0.kf-partition:20069/graytest1/2021-08-03/"
spark = SparkSession\
    .builder\
    .appName("datasettest")\
    .config("spark.sql.shuffle.partitions",9600) \
    .enableHiveSupport() \
    .getOrCreate()
data = spark.sql("select * from table.tableName where ds in ('2021-08-03')")
data.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(pathSa3)

alluxio config as follows:

apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: graytest1 namespace: kf-partition spec: replicas: 1 startReplicas: 2 maxReplicas: 5 tieredstore: levels:

alluxio config:

alluxio.conf.dir=/opt/alluxio-2.6.0/conf alluxio.conf.validation.enabled=false alluxio.debug=false alluxio.extensions.dir=/opt/alluxio-2.6.0/extensions alluxio.fuse.auth.policy.class=alluxio.fuse.auth.SystemUserGroupAuthPolicy alluxio.fuse.auth.policy.custom.group= alluxio.fuse.auth.policy.custom.user= alluxio.fuse.cached.paths.max=1000000 alluxio.fuse.debug.enabled=true alluxio.fuse.fs.name=alluxio-fuse alluxio.fuse.jnifuse.enabled=true alluxio.fuse.logging.threshold=1000ms alluxio.fuse.maxwrite.bytes=128KB alluxio.fuse.shared.caching.reader.enabled=true alluxio.fuse.umount.timeout=1min alluxio.fuse.user.group.translation.enabled=true alluxio.home=/opt/alluxio-2.6.0 alluxio.integration.master.resource.cpu=1 alluxio.integration.master.resource.mem=1024MB alluxio.integration.worker.resource.cpu=1 alluxio.integration.worker.resource.mem=1024MB alluxio.integration.yarn.workers.per.host.max=1 alluxio.job.master.bind.host=0.0.0.0 alluxio.job.master.client.threads=1024 alluxio.job.master.embedded.journal.addresses= alluxio.job.master.embedded.journal.port=20003 alluxio.job.master.finished.job.purge.count=-1 alluxio.job.master.finished.job.retention.time=30sec alluxio.job.master.hostname=172.29.255.225 alluxio.job.master.job.capacity=100000 alluxio.job.master.lost.worker.interval=1sec alluxio.job.master.rpc.addresses= alluxio.job.master.rpc.port=20073 alluxio.job.master.web.bind.host=0.0.0.0 alluxio.job.master.web.hostname=172.29.255.225 alluxio.job.master.web.port=20074 alluxio.job.master.worker.heartbeat.interval=1sec alluxio.job.master.worker.timeout=60sec alluxio.job.worker.bind.host=0.0.0.0 alluxio.job.worker.data.port=20077 alluxio.job.worker.hostname= alluxio.job.worker.rpc.port=20075 alluxio.job.worker.threadpool.size=164 alluxio.job.worker.throttling=false alluxio.job.worker.web.bind.host=0.0.0.0 alluxio.job.worker.web.port=20076 alluxio.jvm.monitor.info.threshold=1sec alluxio.jvm.monitor.sleep.interval=1sec alluxio.jvm.monitor.warn.threshold=10sec alluxio.locality.compare.node.ip=false alluxio.locality.node= alluxio.locality.order=node,rack alluxio.locality.rack= alluxio.locality.script=alluxio-locality.sh alluxio.logger.type=USER_LOGGER alluxio.logs.dir=/opt/alluxio-2.6.0/logs alluxio.logserver.hostname= alluxio.logserver.logs.dir=/opt/alluxio-2.6.0/logs alluxio.logserver.port=45600 alluxio.logserver.threads.max=2048 alluxio.logserver.threads.min=512 alluxio.master.async.persist.size.validation=true alluxio.master.audit.logging.enabled=true alluxio.master.audit.logging.queue.capacity=100000 alluxio.master.backup.abandon.timeout=1min alluxio.master.backup.connect.interval.max=30sec alluxio.master.backup.connect.interval.min=1sec alluxio.master.backup.delegation.enabled=false alluxio.master.backup.directory=/alluxio_backups alluxio.master.backup.entry.buffer.count=10000 alluxio.master.backup.heartbeat.interval=2sec alluxio.master.backup.state.lock.exclusive.duration=0ms alluxio.master.backup.state.lock.forced.duration=15min alluxio.master.backup.state.lock.interrupt.cycle.enabled=true alluxio.master.backup.state.lock.interrupt.cycle.interval=30sec alluxio.master.backup.suspend.timeout=1min alluxio.master.backup.transport.timeout=30sec alluxio.master.bind.host=0.0.0.0 alluxio.master.cluster.metrics.update.interval=1min alluxio.master.daily.backup.enabled=false alluxio.master.daily.backup.files.retained=3 alluxio.master.daily.backup.state.lock.grace.mode=FORCED alluxio.master.daily.backup.state.lock.sleep.duration=10m alluxio.master.daily.backup.state.lock.timeout=12h alluxio.master.daily.backup.state.lock.try.duration=30s alluxio.master.daily.backup.time=05:00 alluxio.master.embedded.journal.addresses= alluxio.master.embedded.journal.appender.batch.size=512KB alluxio.master.embedded.journal.bind.host= alluxio.master.embedded.journal.catchup.retry.wait=1s alluxio.master.embedded.journal.election.timeout=10s alluxio.master.embedded.journal.entry.size.max=10MB alluxio.master.embedded.journal.flush.size.max=160MB alluxio.master.embedded.journal.heartbeat.interval=3s alluxio.master.embedded.journal.port=19200 alluxio.master.embedded.journal.retry.cache.expiry.time=60s alluxio.master.embedded.journal.shutdown.timeout=10sec alluxio.master.embedded.journal.snapshot.replication.chunk.size=4MB alluxio.master.embedded.journal.storage.level=DISK alluxio.master.embedded.journal.transport.max.inbound.message.size=100MB alluxio.master.embedded.journal.transport.request.timeout.ms=5sec alluxio.master.embedded.journal.triggered.snapshot.wait.timeout=2hour alluxio.master.embedded.journal.write.local.first.enabled=true alluxio.master.embedded.journal.write.remote.enabled=false alluxio.master.embedded.journal.write.timeout=30sec alluxio.master.file.access.time.journal.flush.interval=1h alluxio.master.file.access.time.update.precision=1d alluxio.master.file.access.time.updater.shutdown.timeout=1sec alluxio.master.filesystem.liststatus.result.message.length=10000 alluxio.master.format.file.prefix=format alluxio.master.heartbeat.timeout=10min alluxio.master.hostname=172.29.255.225 alluxio.master.journal.catchup.protect.enabled=true alluxio.master.journal.checkpoint.period.entries=2000000 alluxio.master.journal.exit.on.demotion=false alluxio.master.journal.flush.batch.time=5ms alluxio.master.journal.flush.timeout=5min alluxio.master.journal.folder=/journal alluxio.master.journal.gc.period=2min alluxio.master.journal.gc.threshold=5min alluxio.master.journal.init.from.backup= alluxio.master.journal.log.size.bytes.max=500MB alluxio.master.journal.retry.interval=1sec alluxio.master.journal.space.monitor.interval=10min alluxio.master.journal.space.monitor.percent.free.threshold=10 alluxio.master.journal.tailer.shutdown.quiet.wait.time=5sec alluxio.master.journal.tailer.sleep.time=1sec alluxio.master.journal.temporary.file.gc.threshold=30min alluxio.master.journal.tolerate.corruption=false alluxio.master.journal.type=UFS alluxio.master.journal.ufs.option= alluxio.master.jvm.monitor.enabled=true alluxio.master.keytab.file= alluxio.master.lock.pool.concurrency.level=100 alluxio.master.lock.pool.high.watermark=1000000 alluxio.master.lock.pool.initsize=1000 alluxio.master.lock.pool.low.watermark=500000 alluxio.master.log.config.report.heartbeat.interval=1h alluxio.master.lost.worker.detection.interval=10sec alluxio.master.lost.worker.file.detection.interval=5min alluxio.master.metadata.sync.concurrency.level=128 alluxio.master.metadata.sync.executor.pool.size=128 alluxio.master.metadata.sync.report.failure=true alluxio.master.metadata.sync.ufs.prefetch.pool.size=128 alluxio.master.metastore=ROCKS alluxio.master.metastore.dir=/opt/alluxio-2.6.0/metastore alluxio.master.metastore.inode.cache.evict.batch.size=1000 alluxio.master.metastore.inode.cache.high.water.mark.ratio=0.85 alluxio.master.metastore.inode.cache.low.water.mark.ratio=0.8 alluxio.master.metastore.inode.cache.max.size=10000000 alluxio.master.metastore.inode.enumerator.buffer.count=10000 alluxio.master.metastore.inode.inherit.owner.and.group=true alluxio.master.metastore.inode.iteration.crawler.count=4 alluxio.master.metastore.iterator.readahead.size=64MB alluxio.master.metrics.file.size.distribution.buckets=1KB,1MB,10MB,100MB,1GB,10GB alluxio.master.metrics.heap.enabled=true alluxio.master.metrics.service.threads=5 alluxio.master.metrics.time.series.interval=5min alluxio.master.mount.table.root.alluxio=/ alluxio.master.mount.table.root.option= alluxio.master.mount.table.root.readonly=false alluxio.master.mount.table.root.shared=true alluxio.master.mount.table.root.ufs=/underFSStorage alluxio.master.network.max.inbound.message.size=100MB alluxio.master.periodic.block.integrity.check.interval=1hr alluxio.master.periodic.block.integrity.check.repair=false alluxio.master.persistence.blacklist=.staging,_temporary,.tmp alluxio.master.persistence.checker.interval=1s alluxio.master.persistence.initial.interval=1s alluxio.master.persistence.max.interval=1hr alluxio.master.persistence.max.total.wait.time=1day alluxio.master.persistence.scheduler.interval=1s alluxio.master.principal= alluxio.master.replication.check.interval=1min alluxio.master.rpc.addresses= alluxio.master.rpc.executor.core.pool.size=128 alluxio.master.rpc.executor.keepalive=60sec alluxio.master.rpc.executor.max.pool.size=1024 alluxio.master.rpc.executor.min.runnable=1 alluxio.master.rpc.executor.parallelism=8 alluxio.master.rpc.port=20069 alluxio.master.security.impersonation.root.groups= alluxio.master.security.impersonation.root.users= alluxio.master.serving.thread.timeout=5m alluxio.master.shell.backup.state.lock.grace.mode=TIMEOUT alluxio.master.shell.backup.state.lock.sleep.duration=0 alluxio.master.shell.backup.state.lock.timeout=1m alluxio.master.shell.backup.state.lock.try.duration=1m alluxio.master.skip.root.acl.check=false alluxio.master.standby.heartbeat.interval=2min alluxio.master.startup.block.integrity.check.enabled=true alluxio.master.tieredstore.global.level0.alias=MEM alluxio.master.tieredstore.global.level1.alias=SSD alluxio.master.tieredstore.global.level2.alias=HDD alluxio.master.tieredstore.global.levels=3 alluxio.master.tieredstore.global.mediumtype=MEM, SSD, HDD alluxio.master.ttl.checker.interval=1hour alluxio.master.ufs.active.sync.event.rate.interval=60sec alluxio.master.ufs.active.sync.initial.sync.enabled=true alluxio.master.ufs.active.sync.interval=30sec alluxio.master.ufs.active.sync.max.activities=10 alluxio.master.ufs.active.sync.max.age=10 alluxio.master.ufs.active.sync.poll.batch.size=1024 alluxio.master.ufs.active.sync.poll.timeout=10sec alluxio.master.ufs.active.sync.retry.timeout=10sec alluxio.master.ufs.active.sync.thread.pool.size=2 alluxio.master.ufs.block.location.cache.capacity=1000000 alluxio.master.ufs.journal.max.catchup.time=10min alluxio.master.ufs.managed.blocking.enabled= alluxio.master.ufs.path.cache.capacity=100000 alluxio.master.ufs.path.cache.threads=64 alluxio.master.unsafe.direct.persist.object.enabled=true alluxio.master.update.check.enabled=true alluxio.master.update.check.interval=7day alluxio.master.web.bind.host=0.0.0.0 alluxio.master.web.hostname= alluxio.master.web.port=20070 alluxio.master.whitelist=/ alluxio.master.worker.connect.wait.time=5sec alluxio.master.worker.info.cache.refresh.time=10sec alluxio.master.worker.timeout=5min alluxio.metrics.conf.file=/opt/alluxio-2.6.0/conf/metrics.properties alluxio.metrics.context.shutdown.timeout=1sec alluxio.network.connection.auth.timeout=30sec alluxio.network.connection.health.check.timeout=5sec alluxio.network.connection.server.shutdown.timeout=60sec alluxio.network.connection.shutdown.graceful.timeout=45sec alluxio.network.connection.shutdown.timeout=15sec alluxio.network.host.resolution.timeout=5sec alluxio.network.ip.address.used=false alluxio.proxy.s3.deletetype=ALLUXIO_AND_UFS alluxio.proxy.s3.multipart.temporary.dir.suffix=_s3_multipart_tmp alluxio.proxy.s3.writetype=CACHE_THROUGH alluxio.proxy.stream.cache.timeout=1hour alluxio.proxy.web.bind.host=0.0.0.0 alluxio.proxy.web.hostname= alluxio.proxy.web.port=20148 alluxio.secondary.master.metastore.dir=/opt/alluxio-2.6.0/secondary-metastore alluxio.security.authentication.custom.provider.class= alluxio.security.authentication.type=SIMPLE alluxio.security.authorization.permission.enabled=true alluxio.security.authorization.permission.supergroup=supergroup alluxio.security.authorization.permission.umask=027 alluxio.security.group.mapping.cache.timeout=1min alluxio.security.group.mapping.class=alluxio.security.group.provider.ShellBasedUnixGroupsMapping alluxio.security.login.impersonation.username=NONE alluxio.security.login.username= alluxio.security.stale.channel.purge.interval=365d alluxio.site.conf.dir=/opt/alluxio-2.6.0/conf/,/root/.alluxio/,/etc/alluxio/ alluxio.table.catalog.path=/catalog alluxio.table.catalog.udb.sync.timeout=1h alluxio.table.enabled=true alluxio.table.journal.partitions.chunk.size=500 alluxio.table.transform.manager.job.history.retention.time=300sec alluxio.table.transform.manager.job.monitor.interval=10000 alluxio.table.udb.hive.clientpool.MAX=256 alluxio.table.udb.hive.clientpool.min=16 alluxio.test.deprecated.key= alluxio.test.mode=false alluxio.tmp.dirs=/tmp alluxio.underfs.allow.set.owner.failure=false alluxio.underfs.cephfs.auth.id=admin alluxio.underfs.cephfs.auth.key= alluxio.underfs.cephfs.auth.keyfile= alluxio.underfs.cephfs.auth.keyring=/etc/ceph/ceph.client.admin.keyring alluxio.underfs.cephfs.conf.file=/etc/ceph/ceph.conf alluxio.underfs.cephfs.conf.options= alluxio.underfs.cephfs.localize.reads=false alluxio.underfs.cephfs.mds.namespace= alluxio.underfs.cephfs.mon.host=0.0.0.0 alluxio.underfs.cephfs.mount.gid=0 alluxio.underfs.cephfs.mount.point=/ alluxio.underfs.cephfs.mount.uid=0 alluxio.underfs.cleanup.enabled=false alluxio.underfs.cleanup.interval=1day alluxio.underfs.eventual.consistency.retry.base.sleep=50ms alluxio.underfs.eventual.consistency.retry.max.num=20 alluxio.underfs.eventual.consistency.retry.max.sleep=30sec alluxio.underfs.gcs.default.mode=0700 alluxio.underfs.gcs.directory.suffix=/ alluxio.underfs.gcs.owner.id.to.username.mapping= alluxio.underfs.gcs.retry.delay.multiplier=2 alluxio.underfs.gcs.retry.initial.delay=1000 alluxio.underfs.gcs.retry.jitter=true alluxio.underfs.gcs.retry.max=60 alluxio.underfs.gcs.retry.max.delay=60000 alluxio.underfs.gcs.retry.total.duration=300000 alluxio.underfs.gcs.version=2 alluxio.underfs.hdfs.configuration=/opt/alluxio-2.6.0/conf/core-site.xml:/opt/alluxio-2.6.0/conf/hdfs-site.xml alluxio.underfs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem alluxio.underfs.hdfs.prefixes=hdfs://,glusterfs:/// alluxio.underfs.hdfs.remote=true alluxio.underfs.kodo.connect.timeout=50sec alluxio.underfs.kodo.downloadhost= alluxio.underfs.kodo.endpoint= alluxio.underfs.kodo.requests.max=64 alluxio.underfs.listing.length=1000 alluxio.underfs.logging.threshold=10s alluxio.underfs.object.store.breadcrumbs.enabled=false alluxio.underfs.object.store.mount.shared.publicly=false alluxio.underfs.object.store.multi.range.chunk.size=16MB alluxio.underfs.object.store.service.threads=20 alluxio.underfs.object.store.skip.parent.directory.creation=true alluxio.underfs.oss.connection.max=1024 alluxio.underfs.oss.connection.timeout=50sec alluxio.underfs.oss.connection.ttl=-1 alluxio.underfs.oss.socket.timeout=50sec alluxio.underfs.s3.admin.threads.max=1000 alluxio.underfs.s3.bulk.delete.enabled=true alluxio.underfs.s3.connection.ttl=-1 alluxio.underfs.s3.default.mode=0700 alluxio.underfs.s3.directory.suffix=/ alluxio.underfs.s3.disable.dns.buckets=false alluxio.underfs.s3.endpoint= alluxio.underfs.s3.inherit.acl=true alluxio.underfs.s3.intermediate.upload.clean.age=3day alluxio.underfs.s3.list.objects.v1=false alluxio.underfs.s3.max.error.retry= alluxio.underfs.s3.owner.id.to.username.mapping= alluxio.underfs.s3.proxy.host= alluxio.underfs.s3.proxy.port= alluxio.underfs.s3.request.timeout=1min alluxio.underfs.s3.secure.http.enabled=false alluxio.underfs.s3.server.side.encryption.enabled=false alluxio.underfs.s3.signer.algorithm= alluxio.underfs.s3.socket.timeout=50sec alluxio.underfs.s3.streaming.upload.enabled=false alluxio.underfs.s3.streaming.upload.partition.size=64MB alluxio.underfs.s3.threads.max=1200 alluxio.underfs.s3.upload.threads.max=20 alluxio.underfs.version=3.3.0 alluxio.underfs.web.connnection.timeout=60s alluxio.underfs.web.header.last.modified=EEE, dd MMM yyyy HH:mm:ss zzz alluxio.underfs.web.parent.names=Parent Directory,..,../ alluxio.underfs.web.titles=Index of,Directory listing for alluxio.user.app.id= alluxio.user.block.avoid.eviction.policy.reserved.size.bytes=2GB alluxio.user.block.master.client.pool.gc.interval=120sec alluxio.user.block.master.client.pool.gc.threshold=10min alluxio.user.block.master.client.pool.size.max=1024 alluxio.user.block.master.client.pool.size.min=0 alluxio.user.block.read.metrics.enabled=false alluxio.user.block.read.retry.max.duration=2min alluxio.user.block.read.retry.sleep.base=250ms alluxio.user.block.read.retry.sleep.max=2sec alluxio.user.block.remote.read.buffer.size.bytes=8MB alluxio.user.block.size.bytes.default=16MB alluxio.user.block.worker.client.pool.gc.threshold=300sec alluxio.user.block.worker.client.pool.max=10000 alluxio.user.block.worker.client.pool.min=512 alluxio.user.block.write.location.policy.class=alluxio.client.block.policy.RoundRobinPolicy alluxio.user.client.cache.async.restore.enabled=true alluxio.user.client.cache.async.write.enabled=true alluxio.user.client.cache.async.write.threads=16 alluxio.user.client.cache.dir=/tmp/alluxio_cache alluxio.user.client.cache.enabled=false alluxio.user.client.cache.eviction.retries=10 alluxio.user.client.cache.evictor.class=alluxio.client.file.cache.evictor.LRUCacheEvictor alluxio.user.client.cache.evictor.lfu.logbase=2.0 alluxio.user.client.cache.evictor.nondeterministic.enabled=false alluxio.user.client.cache.local.store.file.buckets=1000 alluxio.user.client.cache.page.size=1MB alluxio.user.client.cache.quota.enabled=false alluxio.user.client.cache.size=512MB alluxio.user.client.cache.store.overhead= alluxio.user.client.cache.store.type=LOCAL alluxio.user.client.cache.timeout.duration=-1 alluxio.user.client.cache.timeout.threads=32 alluxio.user.conf.cluster.default.enabled=true alluxio.user.conf.sync.interval=1min alluxio.user.date.format.pattern=MM-dd-yyyy HH:mm:ss:SSS alluxio.user.file.buffer.bytes=8MB alluxio.user.file.copyfromlocal.block.location.policy.class=alluxio.client.block.policy.RoundRobinPolicy alluxio.user.file.create.ttl=-1 alluxio.user.file.create.ttl.action=FREE alluxio.user.file.delete.unchecked=false alluxio.user.file.master.client.pool.gc.interval=120sec alluxio.user.file.master.client.pool.gc.threshold=120sec alluxio.user.file.master.client.pool.size.max=1024 alluxio.user.file.master.client.pool.size.min=0 alluxio.user.file.metadata.load.type=ONCE alluxio.user.file.metadata.sync.interval=-1 alluxio.user.file.passive.cache.enabled=false alluxio.user.file.persist.on.rename=true alluxio.user.file.persistence.initial.wait.time=0 alluxio.user.file.readtype.default=CACHE alluxio.user.file.replication.durable=1 alluxio.user.file.replication.max=1 alluxio.user.file.replication.min=0 alluxio.user.file.reserved.bytes=16MB alluxio.user.file.sequential.pread.threshold=2MB alluxio.user.file.target.media= alluxio.user.file.ufs.tier.enabled=false alluxio.user.file.waitcompleted.poll=1sec alluxio.user.file.write.tier.default=0 alluxio.user.file.writetype.default=ASYNC_THROUGH alluxio.user.hostname= alluxio.user.local.reader.chunk.size.bytes=4MB alluxio.user.local.writer.chunk.size.bytes=64KB alluxio.user.logging.threshold=1000ms alluxio.user.logs.dir=/opt/alluxio-2.6.0/logs/user alluxio.user.master.polling.timeout=30sec alluxio.user.metadata.cache.enabled=true alluxio.user.metadata.cache.expiration.time=2min alluxio.user.metadata.cache.max.size=6000000 alluxio.user.metrics.collection.enabled=true alluxio.user.metrics.heartbeat.interval=10sec alluxio.user.network.data.timeout= alluxio.user.network.flowcontrol.window= alluxio.user.network.keepalive.time= alluxio.user.network.keepalive.timeout= alluxio.user.network.max.inbound.message.size= alluxio.user.network.netty.channel= alluxio.user.network.netty.worker.threads= alluxio.user.network.reader.buffer.size.messages= alluxio.user.network.reader.chunk.size.bytes= alluxio.user.network.rpc.flowcontrol.window=2MB alluxio.user.network.rpc.keepalive.time=9223372036854775807 alluxio.user.network.rpc.keepalive.timeout=30sec alluxio.user.network.rpc.max.connections=1 alluxio.user.network.rpc.max.inbound.message.size=100MB alluxio.user.network.rpc.netty.channel=EPOLL alluxio.user.network.rpc.netty.worker.threads=0 alluxio.user.network.streaming.flowcontrol.window=2MB alluxio.user.network.streaming.keepalive.time=9223372036854775807 alluxio.user.network.streaming.keepalive.timeout=30sec alluxio.user.network.streaming.max.connections=64 alluxio.user.network.streaming.max.inbound.message.size=100MB alluxio.user.network.streaming.netty.channel=EPOLL alluxio.user.network.streaming.netty.worker.threads=0 alluxio.user.network.writer.buffer.size.messages= alluxio.user.network.writer.chunk.size.bytes= alluxio.user.network.writer.close.timeout= alluxio.user.network.writer.flush.timeout= alluxio.user.network.zerocopy.enabled= alluxio.user.rpc.retry.base.sleep=50ms alluxio.user.rpc.retry.max.duration=2min alluxio.user.rpc.retry.max.sleep=3sec alluxio.user.short.circuit.enabled=true alluxio.user.short.circuit.preferred=false alluxio.user.skip.authority.check=false alluxio.user.streaming.data.read.timeout=300sec alluxio.user.streaming.data.write.timeout=1h alluxio.user.streaming.reader.buffer.size.messages=16 alluxio.user.streaming.reader.chunk.size.bytes=4MB alluxio.user.streaming.reader.close.timeout=5s alluxio.user.streaming.writer.buffer.size.messages=16 alluxio.user.streaming.writer.chunk.size.bytes=1MB alluxio.user.streaming.writer.close.timeout=30min alluxio.user.streaming.writer.flush.timeout=30min alluxio.user.streaming.zerocopy.enabled=true alluxio.user.ufs.block.location.all.fallback.enabled=true alluxio.user.ufs.block.read.concurrency.max=2147483647 alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=1 alluxio.user.unsafe.direct.local.io.enabled=false alluxio.user.update.file.accesstime.disabled=true alluxio.user.worker.list.refresh.interval=2min alluxio.version=2.6.0 alluxio.web.cors.enabled=false alluxio.web.file.info.enabled=true alluxio.web.refresh.interval=15s alluxio.web.resources=/opt/alluxio-2.6.0/webui/ alluxio.web.threads=1 alluxio.web.ui.enabled=false alluxio.work.dir=/opt/alluxio-2.6.0 alluxio.worker.allocator.class=alluxio.worker.block.allocator.MaxFreeAllocator alluxio.worker.bind.host=0.0.0.0 alluxio.worker.block.annotator.class=alluxio.worker.block.annotator.LRUAnnotator alluxio.worker.block.annotator.lrfu.attenuation.factor=2.0 alluxio.worker.block.annotator.lrfu.step.factor=0.25 alluxio.worker.block.heartbeat.interval=1sec alluxio.worker.block.heartbeat.timeout=1hour alluxio.worker.block.master.client.pool.size=1024 alluxio.worker.container.hostname= alluxio.worker.data.folder=/alluxioworker/ alluxio.worker.data.folder.permissions=rwxrwxrwx alluxio.worker.data.folder.tmp=.tmp_blocks alluxio.worker.data.server.domain.socket.address= alluxio.worker.data.server.domain.socket.as.uuid=false alluxio.worker.data.tmp.subdir.max=1024 alluxio.worker.evictor.class= alluxio.worker.free.space.timeout=10sec alluxio.worker.fuse.enabled=true alluxio.worker.fuse.mount.alluxio.path=/ alluxio.worker.fuse.mount.options= alluxio.worker.fuse.mount.point=/runtime-mnt/alluxio/kf-partition/graytest1/alluxio-fuse alluxio.worker.hostname= alluxio.worker.jvm.monitor.enabled=true alluxio.worker.keytab.file= alluxio.worker.management.backoff.strategy=ANY alluxio.worker.management.block.transfer.concurrency.limit=2 alluxio.worker.management.load.detection.cool.down.time=10sec alluxio.worker.management.task.thread.count=4 alluxio.worker.management.tier.align.enabled=true alluxio.worker.management.tier.align.range=100 alluxio.worker.management.tier.align.reserved.bytes=1G alluxio.worker.management.tier.promote.enabled=true alluxio.worker.management.tier.promote.quota.percent=90 alluxio.worker.management.tier.promote.range=100 alluxio.worker.management.tier.swap.restore.enabled=true alluxio.worker.master.connect.retry.timeout=1hour alluxio.worker.master.periodical.rpc.timeout=5min alluxio.worker.network.async.cache.manager.queue.max=512 alluxio.worker.network.async.cache.manager.threads.max=8 alluxio.worker.network.block.reader.threads.max=2048 alluxio.worker.network.block.writer.threads.max=10000 alluxio.worker.network.flowcontrol.window=200MB alluxio.worker.network.keepalive.time=30sec alluxio.worker.network.keepalive.timeout=30sec alluxio.worker.network.max.inbound.message.size=200MB alluxio.worker.network.netty.boss.threads=200 alluxio.worker.network.netty.channel=EPOLL alluxio.worker.network.netty.shutdown.quiet.period=2sec alluxio.worker.network.netty.watermark.high=32KB alluxio.worker.network.netty.watermark.low=8KB alluxio.worker.network.netty.worker.threads=2000 alluxio.worker.network.reader.buffer.size=4MB alluxio.worker.network.reader.max.chunk.size.bytes=2MB alluxio.worker.network.shutdown.timeout=15sec alluxio.worker.network.writer.buffer.size.messages=100 alluxio.worker.network.zerocopy.enabled=true alluxio.worker.principal= alluxio.worker.ramdisk.size=7158278826 alluxio.worker.remote.io.slow.threshold=10s alluxio.worker.reviewer.class=alluxio.worker.block.reviewer.ProbabilisticBufferReviewer alluxio.worker.reviewer.probabilistic.hardlimit.bytes=64MB alluxio.worker.reviewer.probabilistic.softlimit.bytes=256MB alluxio.worker.rpc.port=20071 alluxio.worker.session.timeout=1min alluxio.worker.storage.checker.enabled=true alluxio.worker.tieredstore.block.lock.readers=1000 alluxio.worker.tieredstore.block.locks=1000 alluxio.worker.tieredstore.free.ahead.bytes=0 alluxio.worker.tieredstore.level0.alias=SSD alluxio.worker.tieredstore.level0.dirs.mediumtype=SSD alluxio.worker.tieredstore.level0.dirs.path=/data/arsenal/storage/graytest1 alluxio.worker.tieredstore.level0.dirs.quota=1TB alluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 alluxio.worker.tieredstore.level0.watermark.low.ratio=0.9 alluxio.worker.tieredstore.level1.alias= alluxio.worker.tieredstore.level1.dirs.mediumtype= alluxio.worker.tieredstore.level1.dirs.path= alluxio.worker.tieredstore.level1.dirs.quota= alluxio.worker.tieredstore.level1.watermark.high.ratio=0.95 alluxio.worker.tieredstore.level1.watermark.low.ratio=0.7 alluxio.worker.tieredstore.level2.alias= alluxio.worker.tieredstore.level2.dirs.mediumtype= alluxio.worker.tieredstore.level2.dirs.path= alluxio.worker.tieredstore.level2.dirs.quota= alluxio.worker.tieredstore.level2.watermark.high.ratio=0.95 alluxio.worker.tieredstore.level2.watermark.low.ratio=0.7 alluxio.worker.tieredstore.levels=1 alluxio.worker.ufs.block.open.timeout=5min alluxio.worker.ufs.instream.cache.enabled=true alluxio.worker.ufs.instream.cache.expiration.time=5min alluxio.worker.ufs.instream.cache.max.size=5000 alluxio.worker.web.bind.host=0.0.0.0 alluxio.worker.web.hostname= alluxio.worker.web.port=20072 alluxio.zookeeper.address= alluxio.zookeeper.auth.enabled=true alluxio.zookeeper.connection.timeout=15s alluxio.zookeeper.election.path=/alluxio/election alluxio.zookeeper.enabled=false alluxio.zookeeper.job.election.path=/job_election alluxio.zookeeper.job.leader.path=/job_leader alluxio.zookeeper.leader.connection.error.policy=SESSION alluxio.zookeeper.leader.inquiry.retry=10 alluxio.zookeeper.leader.path=/alluxio/leader alluxio.zookeeper.session.timeout=60s aws.accessKeyId= aws.secretKey= fs.azure.account.oauth2.client.endpoint= fs.azure.account.oauth2.client.id= fs.azure.account.oauth2.client.secret= fs.cos.access.key= fs.cos.app.id= fs.cos.connection.max=1024 fs.cos.connection.timeout=50sec fs.cos.region= fs.cos.secret.key= fs.cos.socket.timeout=50sec fs.gcs.accessKeyId= fs.gcs.credential.path= fs.gcs.secretAccessKey= fs.kodo.accesskey= fs.kodo.secretkey= fs.oss.accessKeyId= fs.oss.accessKeySecret= fs.oss.endpoint= fs.swift.auth.method= fs.swift.auth.url= fs.swift.password= fs.swift.region= fs.swift.simulation= fs.swift.tenant= fs.swift.user=

the spark report as follows: 21/08/13 11:40:14 WARN TaskSetManager: Lost task 286.1 in stage 0.0 (TID 872) (172.29.172.170 executor 10): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to cache: Failed to connect to remote block worker: GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071} at alluxio.client.file.AlluxioFileOutStream.handleCacheWriteException(AlluxioFileOutStream.java:300) at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:263) at alluxio.client.file.AlluxioFileOutStream.write(AlluxioFileOutStream.java:217) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at com.linkedin.spark.shaded.org.tensorflow.hadoop.util.TFRecordWriter.write(TFRecordWriter.java:38) at com.linkedin.spark.shaded.org.tensorflow.hadoop.util.TFRecordWriter.write(TFRecordWriter.java:45) at com.linkedin.spark.datasources.tfrecord.TFRecordOutputWriter.write(TFRecordOutputWriter.scala:35) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:140) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286) ... 9 more Caused by: java.io.IOException: Failed to connect to remote block worker: GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071} at alluxio.client.block.stream.BlockWorkerClient$Factory.create(BlockWorkerClient.java:62) at alluxio.client.block.stream.BlockWorkerClientPool.createNewResource(BlockWorkerClientPool.java:72) at alluxio.client.block.stream.BlockWorkerClientPool.createNewResource(BlockWorkerClientPool.java:35) at alluxio.resource.DynamicResourcePool.acquire(DynamicResourcePool.java:319) at alluxio.resource.DynamicResourcePool.acquire(DynamicResourcePool.java:288) at alluxio.client.file.FileSystemContext.acquireBlockWorkerClientInternal(FileSystemContext.java:547) at alluxio.client.file.FileSystemContext.acquireBlockWorkerClient(FileSystemContext.java:527) at alluxio.client.block.stream.GrpcDataWriter.create(GrpcDataWriter.java:98) at alluxio.client.block.stream.DataWriter$Factory.create(DataWriter.java:85) at alluxio.client.block.AlluxioBlockStore.getOutStream(AlluxioBlockStore.java:254) at alluxio.client.block.AlluxioBlockStore.getOutStream(AlluxioBlockStore.java:293) at alluxio.client.file.AlluxioFileOutStream.getNextBlock(AlluxioFileOutStream.java:284) at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:250) ... 21 more Caused by: alluxio.exception.status.UnavailableException: Failed to connect to remote server GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071}. GrpcChannelKey{ClientType=DefaultBlockWorkerClient-Stream, ClientHostname=datasettest-c380c27b3d95f837-exec-10, ServerAddress=GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071}, ChannelId=714fb22b-166f-46c8-b166-0346763e3b49} at alluxio.grpc.GrpcChannelBuilder.build(GrpcChannelBuilder.java:146) at alluxio.client.block.stream.DefaultBlockWorkerClient.(DefaultBlockWorkerClient.java:95) at alluxio.client.block.stream.BlockWorkerClient$Factory.create(BlockWorkerClient.java:59) ... 33 more Caused by: alluxio.exception.status.UnavailableException: Waited 30000 milliseconds (plus 89787 nanoseconds delay) for alluxio.shaded.client.com.google.common.util.concurrent.SettableFuture@638cb839[status=PENDING] at alluxio.security.authentication.AuthenticatedChannelClientDriver.waitUntilChannelAuthenticated(AuthenticatedChannelClientDriver.java:194) at alluxio.security.authentication.AuthenticatedChannelClientDriver.startAuthenticatedChannel(AuthenticatedChannelClientDriver.java:164) at alluxio.security.authentication.ChannelAuthenticator.authenticate(ChannelAuthenticator.java:103) at alluxio.grpc.GrpcChannelBuilder.build(GrpcChannelBuilder.java:129) ... 35 more Caused by: java.util.concurrent.TimeoutException: Waited 30000 milliseconds (plus 89787 nanoseconds delay) for alluxio.shaded.client.com.google.common.util.concurrent.SettableFuture@638cb839[status=PENDING] at alluxio.shaded.client.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:506) at alluxio.shaded.client.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:109) at alluxio.security.authentication.AuthenticatedChannelClientDriver.waitUntilChannelAuthenticated(AuthenticatedChannelClientDriver.java:181) ... 38 more

To Reproduce use the spark and the alluxio config ,the bug will reproduce every time

Expected behavior it will be better if spark can write to alluxio successfullly

Urgency block the use of alluxio

LuQQiu commented 3 years ago

alluxio.master.hostname=172.29.255.225

Caused by: alluxio.exception.status.UnavailableException: Failed to connect to remote server GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071}. GrpcChannelKey{ClientType=DefaultBlockWorkerClient-Stream, ClientHostname=datasettest-c380c27b3d95f837-exec-10, ServerAddress=GrpcServerAddress{HostName=172.29.96.41, SocketAddress=172.29.96.41/172.29.96.41:20071}, ChannelId=714fb22b-166f-46c8-b166-0346763e3b49}
at alluxio.grpc.GrpcChannelBuilder.build(GrpcChannelBuilder.java:146)
at alluxio.client.block.stream.DefaultBlockWorkerClient.(DefaultBlockWorkerClient.java:95)

failed to connect to GrpcServerAddress{HostName=172.29.96.41,

Looks like graytest1-master-0.kf-partition resolved to 172.29.96.41, but master starts at 172.29.255.225 @lilyzhoupeijie can you confirm that

lilyzhoupeijie commented 3 years ago

Thanks for your help and this issue has been properly resolved

lilyzhoupeijie commented 3 years ago

spark write to alluxio,after writing some data,the worker can not connect,spark task fail,but the sparkApplication can success

Alluxio Version: 2.6.0

the spark log as follows: Caused by: alluxio.exception.status.UnavailableException: Failed to connect to remote server GrpcServerAddress{HostName=, SocketAddress=/*:20051}. GrpcChannelKey{ClientType=DefaultBlockWorkerClient-Stream, ClientHostname=datasettest-765a5d7b59157eaa-exec-113, ServerAddress=GrpcServerAddress{HostName=***, SocketAddress=/***:20051}, ChannelId=d6143bf0-5a3e-480e-9231-bf01912a0747} at alluxio.grpc.GrpcChannelBuilder.build(GrpcChannelBuilder.java:146) at alluxio.client.block.stream.DefaultBlockWorkerClient.(DefaultBlockWorkerClient.java:95) at alluxio.client.block.stream.BlockWorkerClient$Factory.create(BlockWorkerClient.java:59) ... 33 more Caused by: alluxio.exception.status.UnavailableException: io exception at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:155) at alluxio.exception.status.AlluxioStatusException.fromStatusRuntimeException(AlluxioStatusException.java:223) at alluxio.exception.status.AlluxioStatusException.fromThrowable(AlluxioStatusException.java:208) at alluxio.security.authentication.AuthenticatedChannelClientDriver.waitUntilChannelAuthenticated(AuthenticatedChannelClientDriver.java:187) at alluxio.security.authentication.AuthenticatedChannelClientDriver.startAuthenticatedChannel(AuthenticatedChannelClientDriver.java:164) at alluxio.security.authentication.ChannelAuthenticator.authenticate(ChannelAuthenticator.java:103) at alluxio.grpc.GrpcChannelBuilder.build(GrpcChannelBuilder.java:129) ... 35 more Caused by: alluxio.shaded.client.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception at alluxio.shaded.client.io.grpc.Status.asRuntimeException(Status.java:535) at alluxio.shaded.client.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718) at alluxio.shaded.client.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at alluxio.shaded.client.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ... 3 more Caused by: alluxio.shaded.client.io.netty.channel.ConnectTimeoutException: connection timed out: /172.29.159.161:20051 at alluxio.shaded.client.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:575) at alluxio.shaded.client.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at alluxio.shaded.client.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at alluxio.shaded.client.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ... 1 more

apc999 commented 3 years ago

@LuQQiu or @yuzhu are you working with Peijie on this?

yuzhu commented 3 years ago

@apc999 i am working with Peijie on this. We are checking various resources such as threadpool and socket ulimit etc, but have not found anything. The default connection time out is 30sec, which should be plenty to establish a connection.

yuzhu commented 3 years ago

@lilyzhoupeijie could you also check if OOM killer could have killed the worker process? Especially since you mentioned alluxio was using a lot of memory.

yuzhu commented 3 years ago

This issue was resolved and the reason was the amount of memory allocated to the container was barely more than the combination of heap and direct memory allocated to alluxio, leaving very little for other parts of the jvm . Increasing the container memory limit or reducing number of threads resolved the issue.