Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.84k stars 2.94k forks source link

alluxio.exception.status.UnavailableException #16903

Closed lingchuanzhou closed 1 year ago

lingchuanzhou commented 1 year ago

我们正在评估使用Alluxio改善AI训练中数据访问性能的效果。 测试验证过程中,遇到一些问题,尝试多种排查方法无果,希望社区能提供建议或帮助。

一、Alluxio使用IT环境

二、Alluxio配置参数

apiVersion: v1
kind: ConfigMap
metadata:
  name: alluxio-config
  namespace: alluxio
  labels:
    name: alluxio-config
    app: alluxio
    chart: alluxio-0.6.53
    release: alluxio
    heritage: Helm
data:
  ALLUXIO_JAVA_OPTS: |-
    -Dalluxio.master.hostname=alluxio-master-0 -Dalluxio.master.journal.type=UFS -Dalluxio.master.journal.folder=/journal -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.master.mount.table.root.ufs=s3://alluxio/s3b -Dalluxio.underfs.s3.endpoint=http://xx.xx.xx.xx:12001 -Dalluxio.underfs.s3.disable.dns.buckets=true -Dalluxio.underfs.s3.inherit.acl=false -Ds3a.accessKeyId=xxx -Ds3a.secretKey=xxx -Dalluxio.security.authentication.type=NOSASL -Dalluxio.security.authorization.permission.enabled=false -Dalluxio.master.jvm.monitor.enabled=true -Dalluxio.user.short.circuit.enabled=true -Dalluxio.worker.data.server.domain.socket.address=/opt/domain -Dalluxio.worker.data.server.domain.socket.as.uuid=true -Dalluxio.worker.network.async.cache.manager.threads.max=256 -Dalluxio.worker.network.block.reader.threads.max=64 -Dalluxio.worker.network.async.cache.manager.queue.max=1024 -Dalluxio.job.worker.threadpool.size=48 -Dalluxio.worker.jvm.monitor.enabled=true -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.user.streaming.data.read.timeout=500s -Dalluxio.user.streaming.data.write.timeout=500s -Dalluxio.user.streaming.reader.close.timeout=500s -Dalluxio.user.streaming.writer.close.timeout=500s -Dalluxio.user.streaming.writer.flush.timeout=600s -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.master.metadata.sync.executor.pool.size=12 -Dalluxio.worker.session.timeout=5min -Dalluxio.debug=true -Dalluxio.network.connection.health.check.timeout=60s -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps -Dalluxio.underfs.s3.request.timeout=5min -Dalluxio.underfs.s3.socket.timeout=500sec -Dalluxio.underfs.s3.upload.threads.max=64 -Dalluxio.master.metadata.sync.concurrency.level=12 -Dalluxio.master.metadata.sync.executor.pool.size=24 -Dalluxio.proxy.s3.header.metadata.max.size=1024KB -Dalluxio.worker.tieredstore.levels=2 -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=20GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.worker.tieredstore.level1.alias=HDD -Dalluxio.worker.tieredstore.level1.dirs.mediumtype=HDD -Dalluxio.worker.tieredstore.level1.dirs.path=/mnt/alluxio-data -Dalluxio.worker.tieredstore.level1.dirs.quota=500GB -Dalluxio.worker.tieredstore.level1.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level1.watermark.low.ratio=0.7
  ALLUXIO_MASTER_JAVA_OPTS: |-
    -Dalluxio.master.hostname=${ALLUXIO_MASTER_HOSTNAME} -server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit
  ALLUXIO_JOB_MASTER_JAVA_OPTS: |-
    -Dalluxio.master.hostname=${ALLUXIO_MASTER_HOSTNAME}
  ALLUXIO_WORKER_JAVA_OPTS: |-
    -Dalluxio.worker.hostname=${ALLUXIO_WORKER_HOSTNAME} -Dalluxio.user.hostname=${ALLUXIO_WORKER_HOSTNAME} -Dalluxio.worker.rpc.port=29999 -Dalluxio.worker.web.port=30000 -Dalluxio.user.short.circuit.enabled=true -Dalluxio.worker.data.server.domain.socket.address=/opt/domain -Dalluxio.worker.data.server.domain.socket.as.uuid=true -Dalluxio.worker.container.hostname=${ALLUXIO_WORKER_CONTAINER_HOSTNAME} -Dalluxio.worker.ramdisk.size=20GB -Dalluxio.worker.tieredstore.levels=2 -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=20GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.worker.tieredstore.level1.alias=HDD -Dalluxio.worker.tieredstore.level1.dirs.mediumtype=HDD -Dalluxio.worker.tieredstore.level1.dirs.path=/mnt/alluxio-data -Dalluxio.worker.tieredstore.level1.dirs.quota=500GB -Dalluxio.worker.tieredstore.level1.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level1.watermark.low.ratio=0.7 -server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/alluxio-runtime-data -XX:+ExitOnOutOfMemoryError
  ALLUXIO_PROXY_JAVA_OPTS: |-

  ALLUXIO_JOB_WORKER_JAVA_OPTS: |-
    -Dalluxio.worker.hostname=${ALLUXIO_WORKER_HOSTNAME} -Dalluxio.job.worker.rpc.port=30001 -Dalluxio.job.worker.data.port=30002 -Dalluxio.job.worker.web.port=30003 -Dalluxio.worker.network.async.cache.manager.threads.max=128
  ALLUXIO_FUSE_JAVA_OPTS: |-
    -Dalluxio.user.hostname=${ALLUXIO_CLIENT_HOSTNAME} -Dalluxio.fuse.mount.alluxio.path=/ -Dalluxio.fuse.mount.point=<nil> -Dalluxio.fuse.mount.options=allow_other -XX:MaxDirectMemorySize=2g -Dalluxio.user.streaming.data.timeout=120s
  ALLUXIO_WORKER_TIEREDSTORE_LEVEL0_DIRS_PATH: /dev/shm
  ALLUXIO_WORKER_TIEREDSTORE_LEVEL1_DIRS_PATH: /mnt/data

三、SparkSQL读取数据

input:58G的parquet数据 sql:很简单,就是查出来,不作etl,然后写回去。 output:58G的parquet数据

spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.establish.timeout", "5000000");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.connection.timeout", "200000000");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.attempts.maximum","0");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.threads.max","50");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.max.total.tasks","2000");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.paging.maximum","1000");
spark.sparkContext().hadoopConfiguration().set("fs.s3a.threads.keepalivetime","300");

Dataset<Row> delta01 = spark.read().parquet(inputPath.split(","))
//                .parquet(inputPath)
                ;
        delta01.registerTempTable("data_table");
        spark.sql("REFRESH TABLE data_table");
        Dataset<Row> sql = spark.sql(inputSql);
        sql.take(10);
        if(isWrite.equals("1")) {
            sql.write()
                    .format("parquet")
                    .mode("append")
                    .option("mergeSchema", "true")
                    .save(outputPath);
        }
        //处理完数据,正常退出
        spark.close();

四、Alluxio和Spark的报错信息

4.1 Spark报错信息

23/02/16 14:32:43 INFO TaskSetManager: Starting task 398.3 in stage 4.0 (TID 2094) (10.42.208.91, executor 5, partition 398, ANY, 5136 bytes) taskResourceAssignments Map()
23/02/16 14:32:43 WARN TaskSetManager: Lost task 409.1 in stage 4.0 (TID 2093) (10.42.208.91 executor 5): java.io.IOException: Failed to cache: io exception (GrpcDataWriter{request=type: ALLUXIO_BLOCK
id: 664075763712
tier: 0
medium_type: ""
pin_on_create: true
space_to_reserve: 67108864
, address=WorkerNetAddress{host=172.23.152.148, containerHost=10.42.208.85, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/5dd53476-0047-4cd7-9f11-f704e3636c18, tieredIdentity=TieredIdentity(node=172.23.152.148, rack=null)}})
        at alluxio.client.file.AlluxioFileOutStream.handleCacheWriteException(AlluxioFileOutStream.java:323)
        at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:286)
        at alluxio.client.file.AlluxioFileOutStream.write(AlluxioFileOutStream.java:240)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.base/java.io.DataOutputStream.write(Unknown Source)
        at java.base/java.io.FilterOutputStream.write(Unknown Source)
        at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:45)
        at org.apache.parquet.hadoop.ParquetFileWriter.start(ParquetFileWriter.java:394)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:483)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:290)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: alluxio.exception.status.UnavailableException: io exception (GrpcDataWriter{request=type: ALLUXIO_BLOCK
id: 664075763712
tier: 0
medium_type: ""
pin_on_create: true
space_to_reserve: 67108864
, address=WorkerNetAddress{host=172.23.152.148, containerHost=10.42.208.85, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/5dd53476-0047-4cd7-9f11-f704e3636c18, tieredIdentity=TieredIdentity(node=172.23.152.148, rack=null)}})
        at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:161)
        at alluxio.client.block.stream.GrpcBlockingStream.toAlluxioStatusException(GrpcBlockingStream.java:303)
        at alluxio.client.block.stream.GrpcBlockingStream.checkError(GrpcBlockingStream.java:284)
        at alluxio.client.block.stream.GrpcBlockingStream.send(GrpcBlockingStream.java:105)
        at alluxio.client.block.stream.GrpcDataWriter.<init>(GrpcDataWriter.java:171)
            at alluxio.client.block.stream.GrpcDataWriter.create(GrpcDataWriter.java:92)
        at alluxio.client.block.stream.DataWriter$Factory.create(DataWriter.java:94)
        at alluxio.client.block.BlockStoreClient.getOutStream(BlockStoreClient.java:288)
        at alluxio.client.block.BlockStoreClient.getOutStream(BlockStoreClient.java:332)
        at alluxio.client.file.AlluxioFileOutStream.getNextBlock(AlluxioFileOutStream.java:307)
        at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:273)
        ... 23 more
Caused by: alluxio.shaded.client.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
        at alluxio.shaded.client.io.grpc.Status.asRuntimeException(Status.java:535)
        at alluxio.shaded.client.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
        at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
        at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
        at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
        at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
        at alluxio.shaded.client.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at alluxio.shaded.client.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        ... 3 more
Caused by: alluxio.shaded.client.io.netty.channel.AbstractChannel$AnnotatedConnectException: connect(..) failed: Connection refused: /opt/domain/5dd53476-0047-4cd7-9f11-f704e3636c18
Caused by: java.net.ConnectException: connect(..) failed: Connection refused
        at alluxio.shaded.client.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at alluxio.shaded.client.io.netty.channel.unix.Socket.connect(Socket.java:239)
        at alluxio.shaded.client.io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:735)
        at alluxio.shaded.client.io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:720)
        at alluxio.shaded.client.io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
        at alluxio.shaded.client.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:562)
        at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:533)
        at alluxio.shaded.client.io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:54)
        at alluxio.shaded.client.io.grpc.netty.WriteBufferingAndExceptionHandler.connect(WriteBufferingAndExceptionHandler.java:155)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538)
        at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at alluxio.shaded.client.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Unknown Source)
23/02/16 14:33:06 INFO ShutdownHookManager: Shutdown hook called
23/02/16 14:33:06 INFO ShutdownHookManager: Deleting directory /tmp/spark-b3ec6a15-3ffb-46db-ba8c-e0e3f2e05109
23/02/16 14:33:06 INFO ShutdownHookManager: Deleting directory /var/data/spark-f11de609-0541-4b6c-b48d-efcf47987e39/spark-f99d372d-ebf9-4825-aa00-621d2280c11f

4.2 Alluxio报错

2023-02-16 06:32:45,604 WARN  BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=40535408, id=2483027969, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/testsmall/dt=2022-07-21/part-00015-0f14585a-bf96-4d14-961a-06e6958ee708.c000.snappy.parquet"
offset_in_file: 67108864
block_size: 40535408
maxUfsReadConcurrency: 2147483647
mountId: 1
no_cache: false
, promote=false, sessionId=8095197842126660955, start=40535400, positionShort=false}. session 8095197842126660955: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/http_log/testsmall/dt=2022-07-21/part-00015-0f14585a-bf96-4d14-961a-06e6958ee708.c000.snappy.parquet bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000002bb0b8-0063edcd1b-40d6bca-default; S3 Extended Request ID: 40d6bca-default-default; Proxy: null)
2023-02-16 06:32:52,783 WARN  BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=64537603, id=10032775168, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/testsmall/dt=2022-07-21/part-00015-b2ab3ec1-6de9-405a-b028-80607fef5604.c000.snappy.parquet"
offset_in_file: 0
block_size: 64537603
maxUfsReadConcurrency: 2147483647
mountId: 1
no_cache: false
, promote=false, sessionId=1807945275015254374, start=64537595, positionShort=false}. session 1807945275015254374: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/http_log/testsmall/dt=2022-07-21/part-00015-b2ab3ec1-6de9-405a-b028-80607fef5604.c000.snappy.parquet bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000002bb601-0063edcd22-40d6bca-default; S3 Extended Request ID: 40d6bca-default-default; Proxy: null)
2023-02-16 06:32:56,163 WARN  BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=27040088, id=6509559808, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/testsmall/dt=2022-07-21/part-00015-d04ed4e7-fc34-4da9-9c7f-c1cda88cb8b6.c000.snappy.parquet"
offset_in_file: 0
block_size: 27040088
maxUfsReadConcurrency: 2147483647
mountId: 1
no_cache: false
, promote=false, sessionId=2363135523617118145, start=27040080, positionShort=false}. session 2363135523617118145: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/http_log/testsmall/dt=2022-07-21/part-00015-d04ed4e7-fc34-4da9-9c7f-c1cda88cb8b6.c000.snappy.parquet bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000002bb6cf-0063edcd26-40d6bca-default; S3 Extended Request ID: 40d6bca-default-default; Proxy: null)
2023-02-16 06:32:56,264 WARN  ServerBootstrap - Unknown channel option 'SO_KEEPALIVE' for channel '[id: 0xaec5a145]'

五、希望得到的协助

经过多方面的排查:怀疑Pod的内存不够进行排查、怀疑alluxio的缓存空间不够进行排查、怀疑底层的存储有问题进行排查(换过minio也报错)、怀疑JVM内存溢出进行排查、怀疑超时参数设置不对等排查,都无法解决上述连接超时或连接拒绝的问题,最终因重试次数过多,Spark会shutdown。

希望社区可以帮忙解决上述问题。

dbw9580 commented 1 year ago

The client got connection refused error when trying to reach the worker via the Unix domain socket. The error means no server was listening on that address. Possible reasons are:

  1. The path to the socket file are in different file system namespaces. Make sure the containers are properly configured so that they actually share the same /opt/domain directory.
  2. The worker had crashed and restarted, so the socket path had changed. But the client was trying to connect using a cached and expired path. If this is the case, try turing off uuids as socket file names, and hardcode the path in both workers' and clients' configuration.
lingchuanzhou commented 1 year ago

第一种情况我确认了,每个worker上面看到的文件和宿主机里面的tmp下的文件是一模一样的。 是第二种情况,worker确实是重启了,你的意思是不用uuid?还是说直接关闭短路功能? 我尝试过关闭短路功能,然而并没有用呢。跑小批量的数据是没有问题的,30G很顺畅,一旦接近50G就开始报错。 另外,我的namespace是两个,就是spark和alluxio分别存在于自己的命名空间里面,使用的是hostpath模式。难道你们的用法必须是spark和alluxio的pod必须存在于同一个命名空间吗

lingchuanzhou commented 1 year ago

Do you mean this? -Dalluxio.worker.data.server.domain.socket.address=/opt/domain/domainsocket -Dalluxio.worker.data.server.domain.socket.as.uuid=false

I had try it, but it still doesn't work..

and report this error: 2023-02-17 09:32:09,791 ERROR GrpcDataServer - Alluxio worker gRPC server failed to start on /opt/domain/domainsocket java.io.IOException: Failed to bind to address /opt/domain/domainsocket at io.grpc.netty.NettyServer.start(NettyServer.java:328) at io.grpc.internal.ServerImpl.start(ServerImpl.java:179) at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) at alluxio.retry.RetryUtils.retry(RetryUtils.java:44) at alluxio.grpc.GrpcServer.start(GrpcServer.java:77) at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:112) at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:148) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37) at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70) Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use 2023-02-17 09:32:09,794 ERROR AlluxioWorker - Fatal error: Failed to create worker process java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/domainsocket at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:154) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37) at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70) Caused by: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/domainsocket at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:117) at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:148) ... 3 more Caused by: java.io.IOException: Failed to bind to address /opt/domain/domainsocket at io.grpc.netty.NettyServer.start(NettyServer.java:328) at io.grpc.internal.ServerImpl.start(ServerImpl.java:179) at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) at alluxio.retry.RetryUtils.retry(RetryUtils.java:44) at alluxio.grpc.GrpcServer.start(GrpcServer.java:77) at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:112) ... 4 more Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use what should I do?

ssz1997 commented 1 year ago

Hi @lingchuanzhou I'm seeing multiple problems here:

  1. When using domain socket, worker failed to start so client has failed to connect
  2. When short circuit is turned off, small dataset works, but large dataset doesn't work.

For the first problem, to use domain socket, Alluxio needs to have read and write permission to the path, in this case, /opt/domain. /opt/domain is created by k8s, which normally is not writable by Alluxio. This is why you are seeing Alluxio worker gRPC server failed to start on /opt/domain/domainsocket java.io.IOException: Failed to bind to address /opt/domain/domainsocket. Just to keep in mind, domain socket only works well when the application (spark in your case) and alluxio are on the same host machine AND spark needs to configure alluxio domain socket as well. More details here: https://docs.alluxio.io/os/user/stable/en/kubernetes/Spark-On-Kubernetes.html#short-circuit-operations

For the second problem, are you allocating each worker 12 cores and 30G memory? Also, I think 2 core and 6G memory for master is too little. We recommend master to have at least 4 cores and 8G memory. You will definitely need more resource if you have a heavy workload.

Regarding to alluxio.worker.data.server.domain.socket.as.uuid, you don't have to change its value.

I have another question for you: Did you try launch spark and Alluxio in different namespaces? Did it work? If not, what was the error?

lingchuanzhou commented 1 year ago

@ssz1997 ok, the first problems:

1. When using domain socket, worker failed to start so client has failed to connect but I had check the permission, there is no problems, drwxrwxrwx , it apprears when restart , not start. [root@aip-worker01 ~]# ll /tmp/ total 1240 drwxrwxrwx 2 root root 114 Feb 18 03:22 alluxio-domain and I have 5 workers, only 2 workers doesn't work, 3 worker is running. the two who doesn't work, permision is ok: drwxrwxrwx [root@aip-master01 singleMaster-localJournal]# sh get-alluxio.sh NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alluxio-master-0 2/2 Running 0 25m 10.42.186.209 aip-worker05 <none> <none> alluxio-worker-bwz7r 1/2 CrashLoopBackOff 7 14m 10.42.19.233 aip-worker01 <none> <none> alluxio-worker-pm558 2/2 Running 6 15m 10.42.208.90 aip-worker03 <none> <none> alluxio-worker-qnx66 1/2 CrashLoopBackOff 7 16m 10.42.186.195 aip-worker05 <none> <none> alluxio-worker-rxgxn 2/2 Running 2 15m 10.42.165.188 aip-worker04 <none> <none> alluxio-worker-wc6mn 2/2 Running 2 15m 10.42.194.76 aip-worker02 <none> <none>

I have another question for you: Did you try launch spark and Alluxio in different namespaces? Did it work? If not, what was the error?

Yes, I have launch in different namespaces. it worked when small volume dataset, example 20G.

ssz1997 commented 1 year ago

All workers share the exact same configurations, so it shouldn't the case where some work but some not. You may need to cleanup the environment before redeploying Alluxio.

To unblock you (to make it at least runnikng), I would suggest to turning of the short-circuit. We won't have the issue of worker not able to start nor the issue of domain socket path changing because of restart. Then we focus on why it doesn't work for larger datasets. Please paste the complete worker log if there's any error.

How are you configuring Alluxio in your Spark? Particularly I'm curious how are you setting alluxio.master.hostname for Alluxio client in Spark.

lingchuanzhou commented 1 year ago

I expose the master by NodePort: [root@aip-master01 singleMaster-localJournal]# kubectl get svc -n alluxio NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alluxio-master-0 NodePort 10.43.96.56 <none> 19998:32700/TCP,19999:32701/TCP,20001:32702/TCP,20002:32703/TCP,19200:32704/TCP,20003:32705/TCP 84s

and I use alluxio data by alluxio://IP:32700

@ssz1997 how about come to our alluxio k8s envroment have a look see see?

lingchuanzhou commented 1 year ago

I have a suggest, you should have a complete example about how to use alluxio in different namespace, and use alluxio:// protocol, rather than LICENCE, that's too simple. the doc should detail or given some best practice about spark on kubernetes.

lingchuanzhou commented 1 year ago

this is my worker log: `2023-02-20 03:50:50,145 INFO GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999 2023-02-20 03:50:50,145 INFO AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain/domainsocket. 2023-02-20 03:50:50,147 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-02-20 03:50:50,147 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-02-20 03:50:50,150 WARN RetryUtils - Failed to Starting gRPC server (attempt 1): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:50,177 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-02-20 03:50:50,177 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-02-20 03:50:50,453 WARN RetryUtils - Failed to Starting gRPC server (attempt 2): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:50,954 WARN RetryUtils - Failed to Starting gRPC server (attempt 3): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:51,455 WARN RetryUtils - Failed to Starting gRPC server (attempt 4): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:51,956 WARN RetryUtils - Failed to Starting gRPC server (attempt 5): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:52,457 WARN RetryUtils - Failed to Starting gRPC server (attempt 6): java.io.IOException: Failed to bind to address /opt/domain/domainsocket 2023-02-20 03:50:52,458 ERROR GrpcDataServer - Alluxio worker gRPC server failed to start on /opt/domain/domainsocket java.io.IOException: Failed to bind to address /opt/domain/domainsocket at io.grpc.netty.NettyServer.start(NettyServer.java:328) at io.grpc.internal.ServerImpl.start(ServerImpl.java:179) at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) at alluxio.retry.RetryUtils.retry(RetryUtils.java:44) at alluxio.grpc.GrpcServer.start(GrpcServer.java:77) at alluxio.worker.grpc.GrpcDataServer.(GrpcDataServer.java:112) at alluxio.worker.AlluxioWorkerProcess.(AlluxioWorkerProcess.java:148) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37) at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70) Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use 2023-02-20 03:50:52,461 ERROR AlluxioWorker - Fatal error: Failed to create worker process java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/domainsocket at alluxio.worker.AlluxioWorkerProcess.(AlluxioWorkerProcess.java:154) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37) at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70) Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use 2023-02-20 03:50:52,461 ERROR AlluxioWorker - Fatal error: Failed to create worker process java.lang.RuntimeException: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/domainsocket at alluxio.worker.AlluxioWorkerProcess.(AlluxioWorkerProcess.java:154) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45) at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37) at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:70) Caused by: java.lang.RuntimeException: Alluxio worker gRPC server failed to start on /opt/domain/domainsocket at alluxio.worker.grpc.GrpcDataServer.(GrpcDataServer.java:117) at alluxio.worker.AlluxioWorkerProcess.(AlluxioWorkerProcess.java:148) ... 3 more Caused by: java.io.IOException: Failed to bind to address /opt/domain/domainsocket at io.grpc.netty.NettyServer.start(NettyServer.java:328) at io.grpc.internal.ServerImpl.start(ServerImpl.java:179) at io.grpc.internal.ServerImpl.start(ServerImpl.java:90) at alluxio.retry.RetryUtils.retry(RetryUtils.java:44) at alluxio.grpc.GrpcServer.start(GrpcServer.java:77) at alluxio.worker.grpc.GrpcDataServer.(GrpcDataServer.java:112) ... 4 more Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use

2023-02-20 03:50:52,483 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-02-20 03:50:52,483 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. Heap garbage-first heap total 262144K, used 78553K [0x00000003c0000000, 0x00000003c2000040, 0x00000007c0000000) region size 32768K, 4 young (131072K), 1 survivors (32768K) Metaspace used 29918K, capacity 30694K, committed 30972K, reserved 1077248K class space used 3539K, capacity 3721K, committed 3836K, reserved 1048576K`

ssz1997 commented 1 year ago

Thanks for the suggestion. The log shows you are not turning off the domain socket. To do so, set alluxio.user.short.circuit.enabled to false.

lingchuanzhou commented 1 year ago

OK,I have set alluxio.user.short.circuit.enabled to false, and report another error. @ssz1997 23/02/22 10:52:14 WARN TaskSetManager: Lost task 273.0 in stage 4.0 (TID 1826) (10.42.165.137 executor 1): java.io.IOException: Failed to cache: io exception (GrpcDataWriter{request=type: ALLUXIO_BLOCK id: 796112453632 tier: 0 medium_type: "" pin_on_create: true space_to_reserve: 67108864 , address=WorkerNetAddress{host=172.23.152.146, containerHost=10.42.19.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/domainsocket, tieredIdentity=TieredIdentity(node=172.23.152.146, rack=null)}}) at alluxio.client.file.AlluxioFileOutStream.handleCacheWriteException(AlluxioFileOutStream.java:323) at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:286) at alluxio.client.file.AlluxioFileOutStream.write(AlluxioFileOutStream.java:240) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.base/java.io.DataOutputStream.write(Unknown Source) at java.base/java.io.FilterOutputStream.write(Unknown Source) at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:45) at org.apache.parquet.hadoop.ParquetFileWriter.start(ParquetFileWriter.java:394) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:483) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:290) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: alluxio.exception.status.UnavailableException: io exception (GrpcDataWriter{request=type: ALLUXIO_BLOCK id: 796112453632 tier: 0 medium_type: "" pin_on_create: true space_to_reserve: 67108864 , address=WorkerNetAddress{host=172.23.152.146, containerHost=10.42.19.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/domainsocket, tieredIdentity=TieredIdentity(node=172.23.152.146, rack=null)}}) at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:161) at alluxio.client.block.stream.GrpcBlockingStream.toAlluxioStatusException(GrpcBlockingStream.java:303) at alluxio.client.block.stream.GrpcBlockingStream.checkError(GrpcBlockingStream.java:284) at alluxio.client.block.stream.GrpcBlockingStream.send(GrpcBlockingStream.java:105) at alluxio.client.block.stream.GrpcDataWriter.<init>(GrpcDataWriter.java:171) at alluxio.client.block.stream.GrpcDataWriter.create(GrpcDataWriter.java:92) at alluxio.client.block.stream.DataWriter$Factory.create(DataWriter.java:94) at alluxio.client.block.BlockStoreClient.getOutStream(BlockStoreClient.java:288) at alluxio.client.block.BlockStoreClient.getOutStream(BlockStoreClient.java:332) at alluxio.client.file.AlluxioFileOutStream.getNextBlock(AlluxioFileOutStream.java:307) at alluxio.client.file.AlluxioFileOutStream.writeInternal(AlluxioFileOutStream.java:273) ... 23 more Caused by: alluxio.shaded.client.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception at alluxio.shaded.client.io.grpc.Status.asRuntimeException(Status.java:535) at alluxio.shaded.client.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427) at alluxio.shaded.client.io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739) at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718) at alluxio.shaded.client.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at alluxio.shaded.client.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ... 3 more Caused by: alluxio.shaded.client.io.netty.channel.ConnectTimeoutException: connection timed out: /10.42.19.240:29999 at alluxio.shaded.client.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:575) at alluxio.shaded.client.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at alluxio.shaded.client.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at alluxio.shaded.client.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ... 1 more

ssz1997 commented 1 year ago

@lingchuanzhou Could you show the worker log as well? The one showed inside client log whose pod ip is 10.42.19.240. The reason why the block is not cached is not passed back to spark.

lingchuanzhou commented 1 year ago

the pod is dead whose ip is 10.42.19.240, and the new start pod report this error: @ssz1997 2023-02-23 06:32:32,983 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=50399806, id=7600078848, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/testsmall/dt=2022-07-21/part-00015-9f38d065-aa26-46a6-b2bd-c1b650d5f67a.c000.snappy.parquet" offset_in_file: 0 block_size: 50399806 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=2586241558463012532, start=50399798, positionShort=false}. session 2586241558463012532: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/http_log/testsmall/dt=2022-07-21/part-00015-9f38d065-aa26-46a6-b2bd-c1b650d5f67a.c000.snappy.parquet bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx00000000000000104ed6b-0063f70786-429f2f9-default; S3 Extended Request ID: 429f2f9-default-default; Proxy: null) 2023-02-23 06:32:59,907 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=6853983, id=5486149635, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/testsmall/dt=2022-07-21/part-00014-c78c081a-f88e-49c9-a78e-5a24e0bba902.c000.snappy.parquet" offset_in_file: 201326592 block_size: 6853983 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=6666877754772567979, start=6853975, positionShort=false}. session 6666877754772567979: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/http_log/testsmall/dt=2022-07-21/part-00014-c78c081a-f88e-49c9-a78e-5a24e0bba902.c000.snappy.parquet bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx000000000000001050d00-0063f707a1-429f2f9-default; S3 Extended Request ID: 429f2f9-default-default; Proxy: null)

ssz1997 commented 1 year ago

Please don't set any value to alluxio.worker.data.server.domain.socket.address for both Alluxio and Alluxio client and try again. Thanks. We didn't completely turn domain socket off previously.

Meanwhile please join Alluxio community slack and ping Jasmine Wang (Alluxio community) and Shouwei Chen. We would love to know more about your use case and see how we can help with your case better.

lingchuanzhou commented 1 year ago

Yes, I haven't set any value to alluxio.worker.data.server.domain.socket.address for both Alluxio and Alluxio client, this is my alluxio config: -Dalluxio.master.hostname=alluxio-master-0 -Dalluxio.master.journal.type=UFS -Dalluxio.master.journal.folder=/journal -Dalluxio.security.stale.channel.purge.interval=365d -Dalluxio.master.mount.table.root.ufs=s3://alluxio/s3b -Dalluxio.underfs.s3.endpoint=http://10.72.1.226:12001 -Dalluxio.underfs.s3.disable.dns.buckets=true -Dalluxio.underfs.s3.inherit.acl=false -Ds3a.accessKeyId=xxxxx -Ds3a.secretKey=xxxx -Dalluxio.security.authentication.type=NOSASL -Dalluxio.security.authorization.permission.enabled=false -Dalluxio.master.jvm.monitor.enabled=true -Dalluxio.user.short.circuit.enabled=false -Dalluxio.worker.data.server.domain.socket.as.uuid=false -Dalluxio.worker.network.async.cache.manager.threads.max=256 -Dalluxio.worker.network.block.reader.threads.max=64 -Dalluxio.worker.network.async.cache.manager.queue.max=1024 -Dalluxio.job.worker.threadpool.size=48 -Dalluxio.worker.jvm.monitor.enabled=true -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.user.streaming.data.read.timeout=500s -Dalluxio.user.streaming.data.write.timeout=500s -Dalluxio.user.streaming.reader.close.timeout=500s -Dalluxio.user.streaming.writer.close.timeout=500s -Dalluxio.user.streaming.writer.flush.timeout=600s -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.master.metadata.sync.executor.pool.size=12 -Dalluxio.worker.session.timeout=5min -Dalluxio.debug=true -Dalluxio.network.connection.health.check.timeout=60s -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps -Dalluxio.underfs.s3.request.timeout=5min -Dalluxio.underfs.s3.socket.timeout=500sec -Dalluxio.underfs.s3.upload.threads.max=64 -Dalluxio.master.metadata.sync.concurrency.level=12 -Dalluxio.master.metadata.sync.executor.pool.size=24 -Dalluxio.proxy.s3.header.metadata.max.size=1024KB -Dalluxio.worker.tieredstore.levels=2 -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=20GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.worker.tieredstore.level1.alias=HDD -Dalluxio.worker.tieredstore.level1.dirs.mediumtype=HDD -Dalluxio.worker.tieredstore.level1.dirs.path=/mnt/alluxio-data -Dalluxio.worker.tieredstore.level1.dirs.quota=500GB -Dalluxio.worker.tieredstore.level1.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level1.watermark.low.ratio=0.7

and this is my client config: sparkConf: spark.kubernetes.authenticate.driver.serviceAccountName: spark-alluxio spark.eventLog.enabled: "true" spark.eventLog.dir: s3a://aip/spark-events/ spark.history.fs.logDirectory: s3a://aip/spark-events/ spark.hadoop.fs.s3a.access.key: xxxx spark.hadoop.fs.s3a.secret.key: xxxx spark.hadoop.fs.s3a.endpoint: http://xxxxx:12001 spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.path.style.access: "true" spark.hadoop.fs.s3a.connection.ssl.enabled: "false" spark.hadoop.fs.s3a.connection.establish.timeout: "50000" spark.hadoop.fs.s3a.connection.timeout: "2000000" spark.hadoop.fs.s3a.threads.max: "100" spark.hadoop.fs.s3a.max.total.tasks: "5000" spark.hadoop.fs.s3a.paging.maximum: "2000" spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2" spark.kryoserializer.buffer.max: 1024m spark.sql.caseSensitive: "false" spark.sql.cbo.enabled: "true" spark.sql.cbo.starSchemaDetection: "true" spark.sql.datetime.java8API.enabled: "false" spark.sql.sources.partitionOverwriteMode: dynamic spark.sql.adaptive.enabled: "true" spark.worker.timeout: "1000000" spark.dynamicAllocation.enabled: "false" spark.shuffle.service.enabled: "false" spark.shuffle.push.enabled: "false" spark.speculation: "true" spark.speculation.quantile: "0.9" spark.kubernetes.memoryOverheadFactor: "0.2" spark.kubernetes.trust.certificates: "true" spark.network.timeout: 1200s spark.shuffle.mapStatus.compression.codec: lz4 spark.shuffle.io.maxRetries: "30" spark.shuffle.io.retryWait: 30s spark.sql.hive.convertMetastoreParquet: "false" spark.sql.legacy.allowNonEmptyLocationInCTAS: "true"

I have contact Jasmine Wang (Alluxio community) and Shouwei Chen in the slack, thank you! @ssz1997

ssz1997 commented 1 year ago

I checked the code theoretically if you are not setting alluxio.worker.data.server.domain.socket.address the worker should not use domain socket. Can you show the worker log at start time?

lingchuanzhou commented 1 year ago

2023-03-03 02:55:21,034 INFO NettyUtils - EPOLL_MODE is available 2023-03-03 02:55:22,295 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 1): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:25,420 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 2): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:26,665 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 3): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:28,087 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 4): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:29,952 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 5): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:32,646 WARN RetryUtils - Failed to load cluster default configuration with master (attempt 6): alluxio.exception.status.UnavailableException: Failed to handshake with master alluxio-master-0:19998 to load cluster default configuration values: UNAVAILABLE: io exception 2023-03-03 02:55:36,243 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=172.23.152.149, rack=null) 2023-03-03 02:55:36,273 INFO BlockWorkerFactory - Creating DefaultBlockWorker 2023-03-03 02:55:36,327 INFO DefaultStorageDir - Folder /mnt/alluxio-data/alluxioworker was created! 2023-03-03 02:55:36,327 INFO DefaultStorageDir - StorageDir initialized: path=/mnt/alluxio-data/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@202ace9f, dirIndex=0, medium=HDD, capacityBytes=536870912000, reservedBytes=1073741824, availableBytes=535797170176 2023-03-03 02:55:36,327 INFO DefaultStorageDir - Folder /dev/shm/alluxioworker was created! 2023-03-03 02:55:36,332 INFO DefaultStorageDir - StorageDir initialized: path=/dev/shm/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@56853c71, dirIndex=0, medium=MEM, capacityBytes=21474836480, reservedBytes=1073741824, availableBytes=20401094656 2023-03-03 02:55:36,346 WARN DefaultStorageTier - Failed to verify memory capacity 2023-03-03 02:55:36,405 INFO MetricsSystem - Starting sinks with config: {}. 2023-03-03 02:55:36,407 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-4789739802544735599. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2023-03-03 02:55:36,470 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-03 02:55:36,470 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-03 02:55:36,480 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-03 02:55:36,858 INFO BlockMapIterator - Worker register stream batchSize=1000000 2023-03-03 02:55:36,892 INFO RegisterStreamer - Worker 496881516526372173 - All requests have been sent. Completing the client side. 2023-03-03 02:55:36,892 INFO RegisterStreamer - Worker 496881516526372173 - Waiting on the master side to complete 2023-03-03 02:55:36,898 INFO RegisterStreamer - 496881516526372173 - Complete message received from the server. Closing stream 2023-03-03 02:55:36,899 INFO RegisterStreamer - Worker 496881516526372173 - Finished registration with a stream 2023-03-03 02:55:36,903 INFO WebServer - Alluxio worker web service starting @ /0.0.0.0:30000 2023-03-03 02:55:36,906 INFO Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_352-b08 2023-03-03 02:55:36,935 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@4504d271{/metrics/json,null,AVAILABLE} 2023-03-03 02:55:36,935 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@207b8649{/metrics/prometheus,null,AVAILABLE} 2023-03-03 02:55:36,937 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@7e928e2f{/,file:///opt/alluxio-2.9.1/webui/worker/build/,STARTING} has uncovered http methods for path: / 2023-03-03 02:55:46,221 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@7e928e2f{/,file:///opt/alluxio-2.9.1/webui/worker/build/,AVAILABLE} 2023-03-03 02:55:46,231 INFO AbstractConnector - Started ServerConnector@54eb2b70{HTTP/1.1, (http/1.1)}{0.0.0.0:30000} 2023-03-03 02:55:46,231 INFO Server - Started @25731ms 2023-03-03 02:55:46,231 INFO WebServer - Alluxio worker web service started @ /0.0.0.0:30000 2023-03-03 02:55:46,241 INFO AlluxioWorkerProcess - Alluxio worker started. id=496881516526372173, bindHost=0.0.0.0, connectHost=172.23.152.149, rpcPort=29999, webPort=30000 2023-03-03 02:55:46,515 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-03 02:55:46,515 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy.

Thanks! @ssz1997

ssz1997 commented 1 year ago

@lingchuanzhou Could you send the full log? I want to see the grpc server started log line. And in the future just upload the whole log file if possible. Thanks!

lingchuanzhou commented 1 year ago

the above is the full log already @ssz1997

it's include this sentence: Alluxio worker web service started @ /0.0.0.0:30000, above log.

lingchuanzhou commented 1 year ago

@ssz1997 the complete log that we have diagnosised:

`2023-03-06 06:42:59,426 INFO NettyUtils - EPOLL_MODE is available 2023-03-06 06:42:59,866 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=172.23.152.150, rack=null) 2023-03-06 06:42:59,899 INFO BlockWorkerFactory - Creating DefaultBlockWorker 2023-03-06 06:42:59,925 INFO DefaultStorageDir - Folder /mnt/alluxio-data/alluxioworker was created! 2023-03-06 06:42:59,925 INFO DefaultStorageDir - StorageDir initialized: path=/mnt/alluxio-data/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@7d280e75, dirIndex=0, medium=HDD, capacityBytes=536870912000, reservedBytes=1073741824, availableBytes=535797170176 2023-03-06 06:42:59,925 INFO DefaultStorageDir - Folder /dev/shm/alluxioworker was created! 2023-03-06 06:42:59,926 INFO DefaultStorageDir - StorageDir initialized: path=/dev/shm/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@7f7cdbee, dirIndex=0, medium=MEM, capacityBytes=21474836480, reservedBytes=1073741824, availableBytes=20401094656 2023-03-06 06:42:59,945 WARN DefaultStorageTier - Failed to verify memory capacity 2023-03-06 06:43:00,023 INFO MetricsSystem - Starting sinks with config: {}. 2023-03-06 06:43:00,025 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-6614478783748504711. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2023-03-06 06:43:00,101 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,102 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,111 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,111 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,133 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-5244687175361727158. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2023-03-06 06:43:00,206 INFO log - Logging initialized @1349ms to org.eclipse.jetty.util.log.Slf4jLog 2023-03-06 06:43:00,371 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,371 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,417 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,418 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:00,421 INFO GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999 2023-03-06 06:43:00,422 INFO ProcessUtils - Starting Alluxio worker @172.23.152.150:29999. 2023-03-06 06:43:00,422 INFO ProcessUtils - Alluxio version: 2.9.1-69e9d112c377ef35c6cd77b110d7e26a81a34c7c 2023-03-06 06:43:00,422 INFO ProcessUtils - Java version: 1.8.0_352 2023-03-06 06:43:00,477 INFO BlockMapIterator - Worker register stream batchSize=1000000 2023-03-06 06:43:00,514 INFO RegisterStreamer - Worker 5633848582743796772 - All requests have been sent. Completing the client side. 2023-03-06 06:43:00,514 INFO RegisterStreamer - Worker 5633848582743796772 - Waiting on the master side to complete 2023-03-06 06:43:00,519 INFO RegisterStreamer - 5633848582743796772 - Complete message received from the server. Closing stream 2023-03-06 06:43:00,520 INFO RegisterStreamer - Worker 5633848582743796772 - Finished registration with a stream 2023-03-06 06:43:00,523 INFO WebServer - Alluxio worker web service starting @ /0.0.0.0:30000 2023-03-06 06:43:00,525 INFO Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_352-b08 2023-03-06 06:43:00,554 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@14d14731{/metrics/json,null,AVAILABLE} 2023-03-06 06:43:00,554 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@2eced48b{/metrics/prometheus,null,AVAILABLE} 2023-03-06 06:43:00,556 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@3c989952{/,file:///opt/alluxio-2.9.1/webui/worker/build/,STARTING} has uncovered http methods for path: / 2023-03-06 06:43:10,011 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@3c989952{/,file:///opt/alluxio-2.9.1/webui/worker/build/,AVAILABLE} 2023-03-06 06:43:10,021 INFO AbstractConnector - Started ServerConnector@404bbcbd{HTTP/1.1, (http/1.1)}{0.0.0.0:30000} 2023-03-06 06:43:10,021 INFO Server - Started @11165ms 2023-03-06 06:43:10,021 INFO WebServer - Alluxio worker web service started @ /0.0.0.0:30000 2023-03-06 06:43:10,031 INFO AlluxioWorkerProcess - Alluxio worker started. id=5633848582743796772, bindHost=0.0.0.0, connectHost=172.23.152.150, rpcPort=29999, webPort=30000 2023-03-06 06:43:10,141 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:10,141 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-06 06:43:59,082 INFO ExtensionFactoryRegistry - Loading core jars from /opt/alluxio-2.9.1/lib 2023-03-06 06:43:59,200 INFO ExtensionFactoryRegistry - Loading extension jars from /opt/alluxio-2.9.1/extensions 2023-03-06 06:43:59,428 WARN S3AUnderFileSystem - Configured s3 max threads (40) is less than # admin threads (20) plus transfer threads (64). Using admin threads + transfer threads as max threads instead. Mar 06, 2023 6:45:38 AM io.grpc.netty.NettyServerHandler onStreamError WARNING: Stream Error io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:167) at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:356) at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1000) at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:956) at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:512) at io.netty.handler.codec.http2.DefaultHttp2Connection.close(DefaultHttp2Connection.java:152) at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:209) at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:417) at io.grpc.netty.NettyServerHandler.channelInactive(NettyServerHandler.java:617) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:750)

2023-03-06 06:45:38,995 WARN BlockReadHandler - Exception occurred while processing read request onError sessionId: BlockReadRequest{chunkSize=1048576, end=67108864, id=948734787584, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00006-ecbb29ef-d04f-4347-8b05-6f62b7270de0.c000.snappy.parquet" offset_in_file: 0 block_size: 67108864 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=7541417773768661396, start=29599957, positionShort=false}, 7541417773768661396: io.grpc.StatusRuntimeException: CANCELLED: client cancelled 2023-03-06 06:48:39,633 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=67108864, id=939876417536, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00004-a3aa2bc3-915a-4881-8d5d-f3cdca0d5bec.c000.snappy.parquet" offset_in_file: 0 block_size: 67108864 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=8220853001500527454, start=27693041, positionShort=false}. session 8220853001500527454: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset 2023-03-06 06:48:40,515 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=23304777, id=942913093634, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00005-336abc58-a167-4c7e-806a-f99264870329.c000.snappy.parquet" offset_in_file: 134217728 block_size: 23304777 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=2220022905663629122, start=23278178, positionShort=false}. session 2220022905663629122: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset 2023-03-06 06:48:41,468 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=19318215, id=961871347714, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00005-4e922b19-f44d-4d47-adb7-1fd9293eb764.c000.snappy.parquet" offset_in_file: 134217728 block_size: 19318215 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=6997509130601909689, start=19318207, positionShort=false}. session 6997509130601909689: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset 2023-03-06 06:49:09,189 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=67108864, id=943248637952, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00000-b2601a7f-32fa-411e-88db-50c51e77b979.c000.snappy.parquet" offset_in_file: 0 block_size: 67108864 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=2851208412171268604, start=26686364, positionShort=false}. session 2851208412171268604: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset 2023-03-06 06:49:59,666 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=8113388701888691427, start=0, positionShort=false}. session 8113388701888691427: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005a56a7-0064058c11-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:50:56,828 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=9167995910219695252, start=0, positionShort=false}. session 9167995910219695252: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005a78dd-0064058c4a-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:51:01,238 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1779446270078554755, start=0, positionShort=false}. session 1779446270078554755: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005a7b5c-0064058c4e-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:51:41,517 WARN BlockReadHandler - Exception occurred while processing read request onError sessionId: BlockReadRequest{chunkSize=1048576, end=67108864, id=939742199808, openUfsBlockOptions=ufs_path: "s3://alluxio/s3b/http_log/dt=2022-07-21/part-00002-f8330fae-2ac4-46fc-8980-b1415d933d3e.c000.snappy.parquet" offset_in_file: 0 block_size: 67108864 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: false , promote=false, sessionId=6036236531291559736, start=29114659, positionShort=false}, 6036236531291559736: io.grpc.StatusRuntimeException: CANCELLED: client cancelled Mar 06, 2023 6:51:41 AM io.grpc.netty.NettyServerHandler onStreamError WARNING: Stream Error io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:167) at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:356) at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1000) at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:956) at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:512) at io.netty.handler.codec.http2.DefaultHttp2Connection.close(DefaultHttp2Connection.java:152) at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:209) at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:417) at io.grpc.netty.NettyServerHandler.channelInactive(NettyServerHandler.java:617) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:750)

2023-03-06 06:51:54,281 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=1038214, id=1009636081664, openUfsBlockOptions=offset_in_file: 0 block_size: 1038214 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3529747046855676898, start=0, positionShort=false}. session 3529747046855676898: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009636081664 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005aa846-0064058c83-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:52:01,222 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=2561480146645965314, start=0, positionShort=false}. session 2561480146645965314: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005ab6bc-0064058c8a-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:52:56,868 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=1038214, id=1009636081664, openUfsBlockOptions=offset_in_file: 0 block_size: 1038214 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3602447971145857985, start=0, positionShort=false}. session 3602447971145857985: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009636081664 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b0d72-0064058cc2-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:53:00,268 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7846811161769996461, start=0, positionShort=false}. session 7846811161769996461: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b132d-0064058cc5-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:53:01,189 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3684735461309780947, start=0, positionShort=false}. session 3684735461309780947: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b14eb-0064058cc6-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:53:30,363 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3639058967090869664, start=0, positionShort=false}. session 3639058967090869664: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b3ded-0064058ce3-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:00,823 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7658140767577040987, start=0, positionShort=false}. session 7658140767577040987: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b5368-0064058d02-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:02,309 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=1038214, id=1009636081664, openUfsBlockOptions=offset_in_file: 0 block_size: 1038214 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1724683763025888730, start=0, positionShort=false}. session 1724683763025888730: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009636081664 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b53fd-0064058d03-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:03,214 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=8252923009168843878, start=0, positionShort=false}. session 8252923009168843878: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b546b-0064058d04-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:28,184 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=528996502614328050, start=0, positionShort=false}. session 528996502614328050: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b5f66-0064058d1d-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:33,154 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=1038214, id=1009636081664, openUfsBlockOptions=offset_in_file: 0 block_size: 1038214 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5182825266436278374, start=0, positionShort=false}. session 5182825266436278374: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009636081664 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b61ae-0064058d22-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:39,180 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5698822043982236875, start=0, positionShort=false}. session 5698822043982236875: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b64e1-0064058d28-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:54:45,833 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=1038214, id=1009636081664, openUfsBlockOptions=offset_in_file: 0 block_size: 1038214 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=628463305851716089, start=0, positionShort=false}. session 628463305851716089: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009636081664 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b689b-0064058d2f-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:55:03,171 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7350973528093082005, start=0, positionShort=false}. session 7350973528093082005: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b7215-0064058d40-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:55:09,080 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7391260392138439801, start=0, positionShort=false}. session 7391260392138439801: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b756a-0064058d46-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:55:39,144 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=6416961997259159075, start=0, positionShort=false}. session 6416961997259159075: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b8775-0064058d64-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:55:42,673 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=55853360526384520, start=0, positionShort=false}. session 55853360526384520: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b89de-0064058d68-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:55:53,180 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=6030396063949383837, start=0, positionShort=false}. session 6030396063949383837: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b92ae-0064058d72-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:03,602 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=406759324424257576, start=0, positionShort=false}. session 406759324424257576: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005b9e47-0064058d7d-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:09,151 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5541175428877067183, start=0, positionShort=false}. session 5541175428877067183: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005ba757-0064058d82-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:14,324 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=2777674897864955048, start=0, positionShort=false}. session 2777674897864955048: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005bb645-0064058d87-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:19,594 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1851769555895400673, start=0, positionShort=false}. session 1851769555895400673: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005bccbc-0064058d8d-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:24,941 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7372848565011811646, start=0, positionShort=false}. session 7372848565011811646: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005bdbfa-0064058d92-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:30,132 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5800534179664358047, start=0, positionShort=false}. session 5800534179664358047: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005be523-0064058d97-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:35,424 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7286452076736252443, start=0, positionShort=false}. session 7286452076736252443: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005bef42-0064058d9c-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:40,915 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1577013115686801910, start=0, positionShort=false}. session 1577013115686801910: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005bf935-0064058da2-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:46,044 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5269911686676769932, start=0, positionShort=false}. session 5269911686676769932: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c03c4-0064058da7-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:51,502 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5898859486506299024, start=0, positionShort=false}. session 5898859486506299024: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c0db3-0064058dad-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:56:56,576 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3938530713690090515, start=0, positionShort=false}. session 3938530713690090515: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c14ae-0064058db2-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:01,737 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3269125246076075069, start=0, positionShort=false}. session 3269125246076075069: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c1c59-0064058db7-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:08,214 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=7821416633338416960, start=0, positionShort=false}. session 7821416633338416960: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c278b-0064058dbd-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:12,368 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=6285525743167748630, start=0, positionShort=false}. session 6285525743167748630: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c30b8-0064058dc1-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:18,978 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3184784560779964710, start=0, positionShort=false}. session 3184784560779964710: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c3ea1-0064058dc8-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:22,654 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=8347682136029012551, start=0, positionShort=false}. session 8347682136029012551: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c4540-0064058dcc-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:30,736 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=383179224896699097, start=0, positionShort=false}. session 383179224896699097: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c5687-0064058dd4-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:33,187 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=6623294441639081535, start=0, positionShort=false}. session 6623294441639081535: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c5a1a-0064058dd6-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:41,455 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1673848091103634812, start=0, positionShort=false}. session 1673848091103634812: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c6734-0064058dde-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:44,775 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=4344915945243655913, start=0, positionShort=false}. session 4344915945243655913: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c6bbd-0064058de2-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:52,167 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=2736691985187302857, start=0, positionShort=false}. session 2736691985187302857: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c73d9-0064058de9-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:57:55,235 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=1108952238303612601, start=0, positionShort=false}. session 1108952238303612601: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c7714-0064058dec-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:02,817 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=184632593406775343, start=0, positionShort=false}. session 184632593406775343: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c7fd2-0064058df4-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:05,943 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=2769472513568949257, start=0, positionShort=false}. session 2769472513568949257: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c8331-0064058df7-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:13,408 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=718922897435160206, start=0, positionShort=false}. session 718922897435160206: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c8c4a-0064058dfe-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:16,421 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=4067512540176042519, start=0, positionShort=false}. session 4067512540176042519: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c8fc1-0064058e01-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:24,142 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=6947961910708124611, start=0, positionShort=false}. session 6947961910708124611: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c9965-0064058e09-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:27,224 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=8302816961788984645, start=0, positionShort=false}. session 8302816961788984645: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005c9dca-0064058e0c-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:34,787 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=2835409233996744413, start=0, positionShort=false}. session 2835409233996744413: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005ca6f0-0064058e14-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:37,687 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=4383177378638110862, start=0, positionShort=false}. session 4383177378638110862: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005cab34-0064058e17-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:45,476 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=855364, id=1009652858880, openUfsBlockOptions=offset_in_file: 0 block_size: 855364 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=5600369109389023158, start=0, positionShort=false}. session 5600369109389023158: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009652858880 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005cb483-0064058e1e-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:48,233 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=3906300018866592110, start=0, positionShort=false}. session 3906300018866592110: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005cb7a2-0064058e21-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) 2023-03-06 06:58:58,994 WARN BlockReadHandler - Exception occurred while reading data for read request BlockReadRequest{chunkSize=1048576, end=802769, id=1009619304448, openUfsBlockOptions=offset_in_file: 0 block_size: 802769 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true , promote=false, sessionId=4895575109739980998, start=0, positionShort=false}. session 4895575109739980998: alluxio.underfs.s3a.AlluxioS3Exception: Failed to open key: s3b/.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1009619304448 bucket: alluxio attempts: 1 error: key??? (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000005cc1cc-0064058e2c-477a696-default; S3 Extended Request ID: 477a696-default-default; Proxy: null) `

ssz1997 commented 1 year ago

@lingchuanzhou We couldn't spot anything in the log. The Netty exceptions are because of the client (spark) cancel the connection. The "NoSuchKey" comes back from the UFS. Since your UFS is self-developed, we are not sure where the problem is. I have two suggestions here:

  1. Mount a hostPath or PV into /opt/alluxio/logs in the worker containers, so even after worker restarts we still have the old logs.
  2. Try using MinIO with the updated configuration. If MinIO works, then it may because of the UFS.
lingchuanzhou commented 1 year ago

@ssz1997 this is the compelete dead worker log: [root@aip-master01 ~]# kubectl logs -f alluxio-worker-gxckl -n alluxio Defaulted container "alluxio-worker" out of: alluxio-worker, alluxio-job-worker Listening for transport dt_socket at address: 60002 2023-03-08 06:13:39,956 INFO NettyUtils - EPOLL_MODE is available 2023-03-08 06:13:40,607 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.65.220.13, rack=null) 2023-03-08 06:13:40,665 INFO BlockWorkerFactory - Creating DefaultBlockWorker 2023-03-08 06:13:40,702 INFO DefaultStorageDir - Folder /mnt/alluxio-data/alluxioworker was created! 2023-03-08 06:13:40,703 INFO DefaultStorageDir - Folder /dev/shm/alluxioworker was created! 2023-03-08 06:13:40,703 INFO DefaultStorageDir - StorageDir initialized: path=/mnt/alluxio-data/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@1d2c434a, dirIndex=0, medium=HDD, capacityBytes=536870912000, reservedBytes=1073741824, availableBytes=535797170176 2023-03-08 06:13:40,703 INFO DefaultStorageDir - StorageDir initialized: path=/dev/shm/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@2cf22610, dirIndex=0, medium=MEM, capacityBytes=21474836480, reservedBytes=1073741824, availableBytes=20401094656 2023-03-08 06:13:40,750 WARN DefaultStorageTier - Failed to verify memory capacity 2023-03-08 06:13:40,854 INFO MetricsSystem - Starting sinks with config: {}. 2023-03-08 06:13:40,856 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-913187246426567714. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2023-03-08 06:13:40,941 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:40,942 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:40,952 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:40,953 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:40,984 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-7992126057128849897. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2023-03-08 06:13:41,108 INFO log - Logging initialized @1903ms to org.eclipse.jetty.util.log.Slf4jLog 2023-03-08 06:13:41,315 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:41,315 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:41,368 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:41,369 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:41,380 INFO GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:29999 2023-03-08 06:13:41,381 INFO ProcessUtils - Starting Alluxio worker @10.65.220.13:29999. 2023-03-08 06:13:41,381 INFO ProcessUtils - Alluxio version: 2.9.1-69e9d112c377ef35c6cd77b110d7e26a81a34c7c 2023-03-08 06:13:41,381 INFO ProcessUtils - Java version: 1.8.0_352 2023-03-08 06:13:41,458 INFO BlockMapIterator - Worker register stream batchSize=1000000 2023-03-08 06:13:41,502 INFO RegisterStreamer - Worker 4780360511759527387 - All requests have been sent. Completing the client side. 2023-03-08 06:13:41,502 INFO RegisterStreamer - Worker 4780360511759527387 - Waiting on the master side to complete 2023-03-08 06:13:41,509 INFO RegisterStreamer - 4780360511759527387 - Complete message received from the server. Closing stream 2023-03-08 06:13:41,509 INFO RegisterStreamer - Worker 4780360511759527387 - Finished registration with a stream 2023-03-08 06:13:41,513 INFO WebServer - Alluxio worker web service starting @ /0.0.0.0:30000 2023-03-08 06:13:41,515 INFO Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_352-b08 2023-03-08 06:13:41,549 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@4a11eb84{/metrics/json,null,AVAILABLE} 2023-03-08 06:13:41,550 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@4e858e0a{/metrics/prometheus,null,AVAILABLE} 2023-03-08 06:13:41,552 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@19553973{/,file:///opt/alluxio-2.9.1/webui/worker/build/,STARTING} has uncovered http methods for path: / 2023-03-08 06:13:50,996 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:50,996 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:13:52,233 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@19553973{/,file:///opt/alluxio-2.9.1/webui/worker/build/,AVAILABLE} 2023-03-08 06:13:52,247 INFO AbstractConnector - Started ServerConnector@4f704591{HTTP/1.1, (http/1.1)}{0.0.0.0:30000} 2023-03-08 06:13:52,247 INFO Server - Started @13042ms 2023-03-08 06:13:52,247 INFO WebServer - Alluxio worker web service started @ /0.0.0.0:30000 2023-03-08 06:13:52,262 INFO AlluxioWorkerProcess - Alluxio worker started. id=4780360511759527387, bindHost=0.0.0.0, connectHost=10.65.220.13, rpcPort=29999, webPort=30000 2023-03-08 06:16:08,814 INFO ExtensionFactoryRegistry - Loading core jars from /opt/alluxio-2.9.1/lib 2023-03-08 06:16:08,926 INFO ExtensionFactoryRegistry - Loading extension jars from /opt/alluxio-2.9.1/extensions 2023-03-08 06:16:09,219 WARN S3AUnderFileSystem - Configured s3 max threads (40) is less than # admin threads (20) plus transfer threads (64). Using admin threads + transfer threads as max threads instead. 2023-03-08 06:21:30,005 INFO GrpcDataServer - Shutting down Alluxio worker gRPC server at 0.0.0.0/0.0.0.0:29999. 2023-03-08 06:21:30,016 INFO AlluxioWorkerProcess - Alluxio worker ended 2023-03-08 06:21:30,016 INFO ProcessUtils - Stopping Alluxio worker @10.65.220.13:29999.

lingchuanzhou commented 1 year ago

@ssz1997 and this is master log: 2023-03-08 06:10:41,901 INFO AlluxioMasterProcess - All masters started. 2023-03-08 06:10:41,902 INFO RpcServerService - Promoting RpcServerService 2023-03-08 06:10:41,941 INFO RpcServerService - registered service METRICS_MASTER_CLIENT_SERVICE 2023-03-08 06:10:41,989 INFO RpcServerService - registered service BLOCK_MASTER_CLIENT_SERVICE 2023-03-08 06:10:41,989 INFO RpcServerService - registered service BLOCK_MASTER_WORKER_SERVICE 2023-03-08 06:10:42,095 INFO RpcServerService - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE 2023-03-08 06:10:42,095 INFO RpcServerService - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE 2023-03-08 06:10:42,095 INFO RpcServerService - registered service FILE_SYSTEM_MASTER_JOB_SERVICE 2023-03-08 06:10:42,117 INFO RpcServerService - registered service JOURNAL_MASTER_CLIENT_SERVICE 2023-03-08 06:10:42,166 INFO RpcServerService - registered service META_MASTER_CONFIG_SERVICE 2023-03-08 06:10:42,166 INFO RpcServerService - registered service META_MASTER_MASTER_SERVICE 2023-03-08 06:10:42,166 INFO RpcServerService - registered service META_MASTER_BACKUP_MESSAGING_SERVICE 2023-03-08 06:10:42,166 INFO RpcServerService - registered service META_MASTER_CLIENT_SERVICE 2023-03-08 06:10:42,195 INFO DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register 2023-03-08 06:10:42,196 INFO WebServerService - Promoting PrimaryOnlyWebServerService 2023-03-08 06:10:42,196 INFO WebServerService - Starting web server. 2023-03-08 06:10:42,382 INFO log - Logging initialized @9948ms to org.eclipse.jetty.util.log.Slf4jLog 2023-03-08 06:10:42,554 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.42.186.205, rack=null) 2023-03-08 06:10:42,560 INFO WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999 2023-03-08 06:10:42,562 INFO Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_352-b08 2023-03-08 06:10:42,598 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@3f270e0a{/metrics/json,null,AVAILABLE} 2023-03-08 06:10:42,599 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@1a760689{/metrics/prometheus,null,AVAILABLE} 2023-03-08 06:10:42,601 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@41d426b5{/,file:///opt/alluxio-2.9.1/webui/master/build/,STARTING} has uncovered http methods for path: / 2023-03-08 06:10:44,645 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.size is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:10:44,645 WARN InstancedConfiguration - alluxio.user.ufs.block.read.location.policy.cache.expiration.time is deprecated. Please avoid using this key in the future. CapacityBaseRandomPolicy no longer caches block locations. To make sure a block is always assigned to the same worker, use DeterministicHashPolicy. 2023-03-08 06:10:48,987 WARN DefaultBlockMaster - Could not find worker id: 3993344183018013900 for heartbeat. 2023-03-08 06:10:49,026 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.127, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)} id: 7994801662990253001 2023-03-08 06:10:49,062 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:10:49,063 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:10:49,064 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=7994801662990253001, workerAddress=WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.127, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678255849064, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:10:49,064 INFO RegisterLeaseManager - Worker 7994801662990253001's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:10:49,375 WARN DefaultBlockMaster - Could not find worker id: 8429217722589384680 for heartbeat. 2023-03-08 06:10:49,393 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.167, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)} id: 3362917880699501162 2023-03-08 06:10:49,406 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:10:49,407 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:10:49,409 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=3362917880699501162, workerAddress=WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.167, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678255849409, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:10:49,409 INFO RegisterLeaseManager - Worker 3362917880699501162's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:10:49,669 WARN DefaultBlockMaster - Could not find worker id: 4906771194060096599 for heartbeat. 2023-03-08 06:10:49,686 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.205, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)} id: 6074888407002155813 2023-03-08 06:10:49,699 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:10:49,699 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:10:49,700 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=6074888407002155813, workerAddress=WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.205, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678255849700, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:10:49,700 INFO RegisterLeaseManager - Worker 6074888407002155813's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:10:50,623 WARN DefaultBlockMaster - Could not find worker id: 281701809452405684 for heartbeat. 2023-03-08 06:10:50,632 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.204, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)} id: 2089002703084017895 2023-03-08 06:10:50,643 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:10:50,643 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:10:50,644 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=2089002703084017895, workerAddress=WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.204, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678255850644, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:10:50,644 INFO RegisterLeaseManager - Worker 2089002703084017895's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:10:57,795 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@41d426b5{/,file:///opt/alluxio-2.9.1/webui/master/build/,AVAILABLE} 2023-03-08 06:10:57,811 INFO AbstractConnector - Started ServerConnector@45d2ade3{HTTP/1.1, (http/1.1)}{0.0.0.0:19999} 2023-03-08 06:10:57,811 INFO Server - Started @25377ms 2023-03-08 06:10:57,811 INFO WebServer - Alluxio Master Web service started @ /0.0.0.0:19999 2023-03-08 06:10:57,811 INFO MetricsService - Promoting PrimaryOnlyMetricsService 2023-03-08 06:10:57,811 INFO MetricsService - Start metric sinks. 2023-03-08 06:10:57,811 INFO AlluxioMasterProcess - Primary started 2023-03-08 06:11:23,352 WARN DefaultBlockMaster - Could not find worker id: 9167432094312900204 for heartbeat. 2023-03-08 06:11:23,358 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.108, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)} id: 2002493756020233691 2023-03-08 06:11:23,388 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:11:23,388 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:11:23,390 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=2002493756020233691, workerAddress=WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.108, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678255883390, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:11:23,391 INFO RegisterLeaseManager - Worker 2002493756020233691's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:14:46,373 WARN InodeSyncStream - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/http_log/dt=2022-07-21/*.parquet, desiredLockPattern=READ, shouldSync={Should sync: false, Last sync time: 0}}, descendantType=NONE, commonOptions=syncIntervalMs: -1 ttl: -1 ttlAction: DELETE , forceSync=true} because it does not exist on the UFS or in Alluxio 2023-03-08 06:14:46,483 INFO UfsJournalLogWriter - Created current log file: UfsJournalFile{location=/journal/BlockMaster/v1/logs/0x1-0x7fffffffffffffff, start=1, end=9223372036854775807, isCheckpoint=false} 2023-03-08 06:14:46,547 INFO UfsJournalLogWriter - Created current log file: UfsJournalFile{location=/journal/FileSystemMaster/v1/logs/0x3-0x7fffffffffffffff, start=3, end=9223372036854775807, isCheckpoint=false} 2023-03-08 06:15:55,702 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.170, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)} id: 4780360511759527387 2023-03-08 06:15:55,765 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:15:55,765 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:15:55,767 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=4780360511759527387, workerAddress=WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.170, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256155767, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:15:55,767 INFO RegisterLeaseManager - Worker 4780360511759527387's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:16:17,885 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.207, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)} id: 1839761610777075509 2023-03-08 06:16:17,949 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:16:17,949 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:16:17,952 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=1839761610777075509, workerAddress=WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.207, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256177951, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:16:17,952 INFO RegisterLeaseManager - Worker 1839761610777075509's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:16:20,241 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.113, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)} id: 9208260495655505713 2023-03-08 06:16:20,320 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:16:20,320 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:16:20,322 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=9208260495655505713, workerAddress=WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.113, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256180322, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:16:20,322 INFO RegisterLeaseManager - Worker 9208260495655505713's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:16:42,874 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.248, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)} id: 5265790760211380933 2023-03-08 06:16:42,944 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:16:42,944 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:16:42,947 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=5265790760211380933, workerAddress=WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.248, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256202947, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:16:42,947 INFO RegisterLeaseManager - Worker 5265790760211380933's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:16:54,975 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.107, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)} id: 9145071862210978075 2023-03-08 06:16:55,051 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:16:55,052 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:16:55,054 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=9145071862210978075, workerAddress=WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.107, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256215054, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:16:55,054 INFO RegisterLeaseManager - Worker 9145071862210978075's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:20:53,743 ERROR DefaultBlockMaster - The worker 3362917880699501162(WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.167, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)}) timed out after 302665ms without a heartbeat! 2023-03-08 06:21:19,759 ERROR DefaultBlockMaster - The worker 2002493756020233691(WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.108, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)}) timed out after 304310ms without a heartbeat! 2023-03-08 06:21:19,761 ERROR DefaultBlockMaster - The worker 2089002703084017895(WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.204, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)}) timed out after 312496ms without a heartbeat! 2023-03-08 06:21:39,759 ERROR DefaultBlockMaster - The worker 6074888407002155813(WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.205, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)}) timed out after 302116ms without a heartbeat! 2023-03-08 06:21:49,759 ERROR DefaultBlockMaster - The worker 7994801662990253001(WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.127, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)}) timed out after 300108ms without a heartbeat! 2023-03-08 06:23:49,005 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.173, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)} id: 6510687035553284689 2023-03-08 06:23:49,064 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:23:49,064 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:23:49,066 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=6510687035553284689, workerAddress=WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.173, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256629066, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:23:49,066 INFO RegisterLeaseManager - Worker 6510687035553284689's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:24:06,825 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.94, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)} id: 2172701592237384959 2023-03-08 06:24:06,919 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:24:06,919 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:24:06,921 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=2172701592237384959, workerAddress=WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.94, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256646921, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:24:06,921 INFO RegisterLeaseManager - Worker 2172701592237384959's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:24:23,991 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.77, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)} id: 1921433782660492447 2023-03-08 06:24:24,079 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:24:24,080 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:24:24,082 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=1921433782660492447, workerAddress=WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.77, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256664082, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:24:24,082 INFO RegisterLeaseManager - Worker 1921433782660492447's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:24:38,026 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.209, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)} id: 3713171305280122828 2023-03-08 06:24:38,107 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:24:38,107 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:24:38,110 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=3713171305280122828, workerAddress=WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.209, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256678110, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:24:38,110 INFO RegisterLeaseManager - Worker 3713171305280122828's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:24:48,560 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.209, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)} id: 3352298987420823053 2023-03-08 06:24:48,629 INFO RegisterStreamObserver - Register stream completed on the client side 2023-03-08 06:24:48,629 INFO DefaultBlockMaster - Found 0 blocks to remove from the workerInfo 2023-03-08 06:24:48,631 INFO DefaultBlockMaster - Worker successfully registered: MasterWorkerInfo{id=3352298987420823053, workerAddress=WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.209, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)}, capacityBytes=558345748480, usedBytes=2147483648, lastUpdatedTimeMs=1678256688631, blocks=LongOpenHashSet{0 entries}, lostStorage={}, version=2.9.1, revision=69e9d112c377ef35c6cd77b110d7e26a81a34c7c} 2023-03-08 06:24:48,631 INFO RegisterLeaseManager - Worker 3352298987420823053's lease is not found. Most likely the lease has already been recycled. 2023-03-08 06:28:52,801 ERROR DefaultBlockMaster - The worker 4780360511759527387(WorkerNetAddress{host=10.65.220.13, containerHost=10.42.165.170, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.13, rack=null)}) timed out after 308105ms without a heartbeat! 2023-03-08 06:29:02,801 ERROR DefaultBlockMaster - The worker 9145071862210978075(WorkerNetAddress{host=10.65.220.12, containerHost=10.42.208.107, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.12, rack=null)}) timed out after 301419ms without a heartbeat! 2023-03-08 06:29:18,929 ERROR DefaultBlockMaster - The worker 9208260495655505713(WorkerNetAddress{host=10.65.220.11, containerHost=10.42.194.113, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.11, rack=null)}) timed out after 300591ms without a heartbeat! 2023-03-08 06:29:38,929 ERROR DefaultBlockMaster - The worker 5265790760211380933(WorkerNetAddress{host=10.65.220.9, containerHost=10.42.19.248, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.9, rack=null)}) timed out after 306336ms without a heartbeat! 2023-03-08 06:29:48,929 ERROR DefaultBlockMaster - The worker 1839761610777075509(WorkerNetAddress{host=10.65.220.14, containerHost=10.42.186.207, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.65.220.14, rack=null)}) timed out after 305111ms without a heartbeat!

ssz1997 commented 1 year ago

@lingchuanzhou There's no error before the work stops. With the fact that it only fails with a relatively large amount of data, I highly doubt there's some memory issue with the container. My suggestions are:

  1. Rerun the same failed job, then run grep -i -r 'out of memory' /var/log/ on the host machine to see whether OOM happened.
  2. Add another jvm option -XX:MaxDirectMemorySize=10G to limit the off-heap usage of worker, and run the spark job again. Check the log to see if there's any issue with direct memory.
  3. Memory-based volume always allocates half of the host machine' memory according to this blog. This potentially poses a problem. Try not to use memory as part of the worker storage and try the job again.

This blog may also help: https://suneeta-mall.github.io/2021/03/14/wth-who-killed-my-pod.html

lingchuanzhou commented 1 year ago

@ssz1997 Thanks for you advise! I have add -XX:MaxDirectMemorySize=10G option, but it does'nt work. and when the worker stops, I haven't fund oom log by grep -i -r 'out of memory' /var/log/ . what a depressed thing is that the worker stops when reading 2GB size dataset.

ssz1997 commented 1 year ago

The issue can be mitigated by not using memory for alluxio storage and not using domain socket. @lingchuanzhou I'm closing the issue for now. Feel free to open new issues when you have more problems.