Open CharlesBuell opened 2 months ago
Do you mind add more env info or the error message?
Thanks for the quick response. Here's some more environment information.
Alluxio is running with one master and one worker node on the same system, with the following site configuration, being attached to HDFS locally:
alluxio-site.properties
alluxio.master.mount.table.root.ufs=hdfs://127.0.0.1:9000/
alluxio.dora.client.ufs.root=hdfs://127.0.0.1:9000/
alluxio.worker.tieredstore.levels=0
alluxio-env.sh
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The error is coming from an environment where TPC-DS is being ran with an Apache Hadoop storage medium for data being queried by TPC-DS. Using Apache Spark for executing queries and Alluxio for caching data queried, the following is outputted when a query fails between Spark/Alluxio:
Spark error
org.apache.spark.SparkException: Multiple failures in stage materialization.
at org.apache.spark.sql.errors.QueryExecutionErrors$.multiFailuresInStageMaterializationError(QueryExecutionErrors.scala:2076)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.cleanUpAndThrowException(AdaptiveSparkPlanExec.scala:809)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:333)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:272)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:417)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:475)
at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:76)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:76)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:229)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Suppressed: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
... 14 more
Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
... 1 more
Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: [CIRCULAR REFERENCE: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
... 14 more
Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
... 1 more
Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
Driver stacktrace:]
Alluxio error (left some additional debug logs around it for context if needed)
2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2424](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: e22b1d91-1ed6-4233-a3be-b31e79320e50
2024-08-23 00:34:43,049 DEBUG [NettyReaderExecutor-12443](LocalCacheManager.java:430) - putInternal(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,050 ERROR [NettyReaderExecutor-12817](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=f7e2d893c181d37ea09a20af015f59c381e3493ad4dbab67b8d43ba4a488c1cb, PageIndex=0} not available
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12817](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1042648, cache size 488064465
2024-08-23 00:34:43,050 DEBUG [NettyReaderExecutor-12839](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2434](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: 10930bcf-43ac-4b39-9053-9937cd42498e
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12795](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=a67ab3c108d458fe607c8e24f346de082724612baef30a3d2a6969b9e47707fc, PageIndex=0} not available
2024-08-23 00:34:43,053 ERROR [NettyReaderExecutor-12795](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1048576, cache size 488064465
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12817](LocalCacheManager.java:338) - put(PageId{FileId=1d712123a45e300893de23c24703af413e98634c60fa588a73e0b3f22c9107a3, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:504) - Page PageId{FileId=ebf731c6c100e4b2cb802f7701be3f8bfe6443f914e941f970374026e27d9e52, PageIndex=0} is unavailable to evict, likely due to a benign race
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12795](LocalCacheManager.java:338) - put(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:42) - State transitioned from COMPLETING_REQUEST to CHANNEL_IDLE because of trigger REQUEST_COMPLETED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:37) - Trigger fired: REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from SENDING_DATA to EOF because of trigger OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: COMPLETE_REQUES
@jja725 Hello, do you need any more information for this issue? Could it be possible to fix this through alluxio-site.properties
? Thanks.
how about set a bigger cache size in alluxio-site.properties?
try to allocate more cache. modify alluxio.worker.page.store.sizes=<a bigger size>
, the default size is 512MB
Increasing it does help a little in terms of how many queries are able to complete, but it doesn't completely solve the issue. Once I get to around the size of the dataset, then all the queries are able to complete.
Alluxio Version
v313
Describe the bug
Two problems: Multiple write requests can try to evict the same page in cache and only one can succeed, the other is the evict page size is not checked against the requested size so the cache eviction can not free enough space for write request.
A TPC benchmark to do with big data and hadoop, using alluxio as a caching layer.
Multiple queries coming at the same time during the throughput query stage of a benchmark will end up failing due to cache eviction issues described earlier.
Please let me know if additional information is required to get this fixed. Thanks!