Open sfwanyi opened 3 years ago
@beinan I remember there is a setting in Presto/ Trino that we have to enable to get Local Reads. Could you point to the documentation here?
@sfwanyi did you disable determinstic hash read location policy?
@yuzhu alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.DeterministicHashPolicy 没有设置该属性了,已经去掉了该字段,还是没有实现短路实现
@beinan I remember there is a setting in Presto/ Trino that we have to enable to get Local Reads. Could you point to the documentation here? 确认下,大佬们?
@sfwanyi you're getting the error as below, and you're seeing multiple UUIDs in /opt/domain (I'm not sure if this is an issue or not)
Caused by: alluxio.shaded.client.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at alluxio.shaded.client.io.grpc.Status.asRuntimeException(Status.java:533)
at alluxio.shaded.client.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
at alluxio.shaded.client.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at alluxio.shaded.client.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at alluxio.shaded.client.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at alluxio.grpc.GrpcChannel$ChannelResponseTracker$1$1.onClose(GrpcChannel.java:172)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at alluxio.shaded.client.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at alluxio.shaded.client.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at alluxio.shaded.client.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
... 3 more
Caused by: alluxio.shaded.client.io.netty.channel.AbstractChannel.AnnotatedConnectException: connect(..) failed: Connection refused: /opt/domain/774df77a-c449-4505-8759-1c692e4fcaf1
I think something is wrong from short-circuit. So I suggest keeping all the configs as default rather than involving new configs, let's focus on enabling the short circuit at first.
@beinan 明天上午我们再一起远程会议下,北南博士有没有时间咧。
@beinan 北南博士,我们发现是因为/opt/domain/里面存在多个uuid导致报错,现在/opt/domain里面只存在一个uuid,trino可以正常访问了,但是trino一直在告警,可以帮忙定位下问题
2021-10-13T11:00:42.999Z WARN 20211013_110042_00018_8jxku.1.5-87-114 alluxio.client.file.AlluxioFileInStream Failed to read block 201326609 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_92553410-454f-426a-b52f-3cf12c4db156 from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/201326609 (No such file or directory). 2021-10-13T11:00:43.000Z WARN 20211013_110042_00018_8jxku.1.5-88-101 alluxio.client.file.AlluxioFileInStream Failed to read block 201326600 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_92553410-454f-426a-b52f-3cf12c4db156 from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/201326600 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-46-88 alluxio.client.file.AlluxioFileInStream Failed to read block 134217750 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_2a1b1499-f28d-4d86-b5fa-b02c4f1d54da from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/134217750 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-26-86 alluxio.client.file.AlluxioFileInStream Failed to read block 218103828 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_b1751036-f6a8-4227-adff-2c31453e9fd1 from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/218103828 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-20-101 alluxio.client.file.AlluxioFileInStream Failed to read block 218103828 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_b1751036-f6a8-4227-adff-2c31453e9fd1 from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/218103828 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-51-66 alluxio.client.file.AlluxioFileInStream Failed to read block 134217750 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_2a1b1499-f28d-4d86-b5fa-b02c4f1d54da from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/134217750 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-25-64 alluxio.client.file.AlluxioFileInStream Failed to read block 218103828 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_b1751036-f6a8-4227-adff-2c31453e9fd1 from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/218103828 (No such file or directory). 2021-10-13T11:04:13.516Z WARN 20211013_110413_00019_8jxku.1.5-62-60 alluxio.client.file.AlluxioFileInStream Failed to read block 134217750 of file /user/hive/warehouse/sf100.db/lineitem/20211008_023109_00054_6f69z_2a1b1499-f28d-4d86-b5fa-b02c4f1d54da from worker WorkerNetAddress{host=172.19.1.71, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=/opt/domain/25d94025-90e3-4977-9d3e-9a3fb7e147ea, tieredIdentity=TieredIdentity(node=172.19.1.71, rack=null)}. This worker will be skipped for future read operations, will retry: java.io.FileNotFoundException: /dev/shm/alluxioworker/134217750 (No such file or directory)
@sfwanyi is the trino query failed or it's still returning the correct result? I had never seen this kind of error before. More like something wrong with your ram disk. (I'm just guessing)
at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) (Zero Copy GrpcDataReader) 2021-10-18 06:34:56,766 WARN InodeSyncStream - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/ml-100k/u.user, desiredLockPattern=READ, shouldSync=true}, descendantType=ONE, commonOptions=syncIntervalMs: 30000 ttl: -1 ttlAction: DELETE , forceSync=false, isGetFileInfo=false}: alluxio.exception.FileDoesNotExistException: Path "/ml-100k/u.user" does not exist. 2021-10-18 06:36:08,625 WARN DefaultFileSystemMaster - The persist job (id=1634530850072) for file /ml-100k/genome-scores.csv (id=1275068415) failed: Task execution failed: /mnt/ramdisk/alluxioworker/1258291200 (No such file or directory) 2021-10-18 06:37:00,628 WARN DefaultFileSystemMaster - The persist job (id=1634530850073) for file /ml-100k/u.user (id=1291845631) failed: Task execution failed: /mnt/ramdisk/alluxioworker/1275068416 (No such file or directory) 2021-10-18 06:38:11,631 WARN DefaultFileSystemMaster - The persist job (id=1634530850074) for file /ml-100k/genome-scores.csv (id=1275068415) failed: Task execution failed: Failed to read block ID=1258291200 from tiered storage and UFS tier: java.io.IOException: Failed to read from UFS, sessionId=1985375140884839335, blockId=1258291200, offset=0, positionShort=false, options=offset_in_file: 0 block_size: 67108864 maxUfsReadConcurrency: 2147483647 mountId: 1 no_cache: true block_in_ufs_tier: true : java.io.FileNotFoundException: File does not exist: /.alluxio_ufs_blocks.alluxio.0x1D91AC0E01AB0165.tmp/1258291200 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1954) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:755) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:439) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915) (Zero Copy GrpcDataReader)
@beinan
多个uuid导致报错
Hi @sfwanyi , can i know what you do to keep only 1 uuid under domain socket folder? Thanks in advance
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
alluxio基于k8s版本为2.6.2,trino版本359,目前访问链路为trino直接访问hive-metastore, 在metastore-site.xml配置文件内配置参数
alluxio.yaml配置文件参数:
csi: accessModes:
jvmOptions:
- -XX:+UnlockExperimentalVMOptions
- -XX:+UseCGroupMemoryLimitForHeap
- -XX:MaxRAMFraction=2
logserver: accessModes:
tieredstore: levels:
ALLUXIO_JAVA_OPTS配置参数: -Dalluxio.master.hostname=alluxio-master-0 -Dalluxio.master.journal.type=UFS -Dalluxio.master.journal.folder=/journal -Dalluxio.user.metrics.collection.enabled=true -Dalluxio.debug=true -Dalluxio.user.short.circuit.enabled=true -Dalluxio.worker.data.server.domain.socket.as.uuid=true -Dalluxio.master.journal.ufs.option.alluxio.underfs.hdfs.configuration=/secrets/hdfsConfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml -Dalluxio.master.mount.table.root.option.alluxio.underfs.version=3.2 -Dalluxio.master.mount.table.root.ufs=hdfs://hdfs-namenodes:8020 -Dalluxio.security.authentication.type=NOSASL -Dalluxio.security.authorization.permission.enabled=false -Dalluxio.user.file.metadata.load.type=ONCE -Dalluxio.user.file.passive.cache.enabled=true -Dalluxio.user.file.readtype.default=CACHE -Dalluxio.user.file.writetype.default=ASYNC_THROUGH -Dalluxio.user.network.data.timeout=10min -Dalluxio.worker.allocator.class=alluxio.worker.block.allocator.RoundRobinAllocator -Dalluxio.worker.evictor.class=alluxio.worker.block.evictor.LRUEvictor -Dalluxio.worker.ramdisk.size=64GB -Dalluxio.worker.tieredstore.level0.alias=MEM -Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM -Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm -Dalluxio.worker.tieredstore.level0.dirs.quota=64GB -Dalluxio.worker.tieredstore.level0.watermark.high.ratio=0.95 -Dalluxio.worker.tieredstore.level0.watermark.low.ratio=0.7 -Dalluxio.worker.tieredstore.level1.alias=SSD -Dalluxio.worker.tieredstore.level1.dirs.mediumtype=SSD -Dalluxio.worker.tieredstore.level1.dirs.path=/ssd-disk -Dalluxio.worker.tieredstore.level1.dirs.quota=100GB -Dalluxio.worker.tieredstore.level1.watermark.high.ratio=0.9 -Dalluxio.worker.tieredstore.level1.watermark.low.ratio=0.7 -Dalluxio.worker.tieredstore.levels=2 -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
trino配置参数: jvm.config -Xbootclasspath/a:/etc/trino/alluxio/
alluxio-site.properties alluxio.user.metrics.collection.enabled=true alluxio.user.short.circuit.enabled=true alluxio.user.metrics.heartbeat.interval=5sec alluxio.worker.data.server.domain.socket.as.uuid=true alluxio.worker.data.server.domain.socket.address=/opt/domain
创建Trino基于alluxio存储的schema 问题:alluxio本地短路读取一直为0
目前不清楚问题出现在哪里