Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.87k stars 2.94k forks source link

[v313] Cache eviction issue with multiple requests and unequal page/requested size #18674

Open CharlesBuell opened 3 months ago

CharlesBuell commented 3 months ago

Alluxio Version

v313

Describe the bug

Two problems: Multiple write requests can try to evict the same page in cache and only one can succeed, the other is the evict page size is not checked against the requested size so the cache eviction can not free enough space for write request.

A TPC benchmark to do with big data and hadoop, using alluxio as a caching layer.

Multiple queries coming at the same time during the throughput query stage of a benchmark will end up failing due to cache eviction issues described earlier.

Please let me know if additional information is required to get this fixed. Thanks!

jja725 commented 3 months ago

Do you mind add more env info or the error message?

CharlesBuell commented 2 months ago

Thanks for the quick response. Here's some more environment information.

Alluxio is running with one master and one worker node on the same system, with the following site configuration, being attached to HDFS locally:

alluxio-site.properties

alluxio.master.mount.table.root.ufs=hdfs://127.0.0.1:9000/
alluxio.dora.client.ufs.root=hdfs://127.0.0.1:9000/

alluxio.worker.tieredstore.levels=0

alluxio-env.sh

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

The error is coming from an environment where TPC-DS is being ran with an Apache Hadoop storage medium for data being queried by TPC-DS. Using Apache Spark for executing queries and Alluxio for caching data queried, the following is outputted when a query fails between Spark/Alluxio:

Spark error

org.apache.spark.SparkException: Multiple failures in stage materialization.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.multiFailuresInStageMaterializationError(QueryExecutionErrors.scala:2076)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.cleanUpAndThrowException(AdaptiveSparkPlanExec.scala:809)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:333)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:272)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:417)
    at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390)
    at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:475)
    at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$2(SparkSQLDriver.scala:76)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:76)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
    at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
    at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:229)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Suppressed: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
    at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
    at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
    at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
    at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
    at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
    at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
    at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
    at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
    Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
        at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
        at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
        at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
        at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
        at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
        at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
        at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
        at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
        ... 14 more
    Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
        at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
        at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
        at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
        at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        ... 1 more
        Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
            at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
            Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
        Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
            at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
            at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
            at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
            at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
            at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
            at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
            at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
            at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
            ... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: [CIRCULAR REFERENCE: org.apache.spark.SparkException: Job aborted due to stage failure: Task 65 in stage 3.0 failed 1 times, most recent failure: Lost task 65.0 in stage 3.0 (TID 96) (localhost executor driver): java.lang.IllegalArgumentException: Self-suppression not permitted
    at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
    at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
    at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
    at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
    at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
    at org.apache.spark.shuffle.S3MeasureOutputStream.flush(S3MeasureOutputStream.scala:40)
    at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.getPartitionWriter(S3ShuffleMapOutputWriter.scala:74)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedData(BypassMergeSortShuffleWriter.java:211)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:180)
    at org.apache.spark.shuffle.S3ShuffleWriter.write(S3ShuffleWriter.scala:8)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
    Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
        at java.base/java.lang.Throwable.addSuppressed(Throwable.java:1054)
        at alluxio.client.file.DoraFileOutStream.writeInternal(DoraFileOutStream.java:263)
        at alluxio.client.file.DoraFileOutStream.write(DoraFileOutStream.java:226)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
        at org.apache.spark.shuffle.S3MeasureOutputStream.close(S3MeasureOutputStream.scala:49)
        at java.base/java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(Channels.java:476)
        at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
        at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.cleanUp(S3ShuffleMapOutputWriter.scala:123)
        at org.apache.spark.shuffle.S3ShuffleMapOutputWriter.abort(S3ShuffleMapOutputWriter.scala:118)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:185)
        ... 14 more
    Caused by: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4
        at alluxio.exception.status.AlluxioStatusException.from(AlluxioStatusException.java:174)
        at alluxio.util.CommonUtils.unwrapResponseFrom(CommonUtils.java:723)
        at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelRead(NettyDataWriter.java:448)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
        at alluxio.shaded.client.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at alluxio.shaded.client.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at alluxio.shaded.client.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        ... 1 more
        Suppressed: alluxio.shaded.client.io.netty.channel.StacklessClosedChannelException
            at alluxio.shaded.client.io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source)
            Suppressed: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]
        Suppressed: java.io.IOException: Channel [id: 0x17db5241, L:0.0.0.0/0.0.0.0:42622 ! R:localhost/127.0.0.1:29997] is closed when writing block 0.
            at alluxio.client.file.dora.netty.NettyDataWriter$PacketWriteResponseHandler.channelUnregistered(NettyDataWriter.java:476)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:219)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:188)
            at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1388)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:215)
            at alluxio.shaded.client.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:195)
            at alluxio.shaded.client.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
            at alluxio.shaded.client.io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:821)
            at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
            at alluxio.shaded.client.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
            at alluxio.shaded.client.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
            at alluxio.shaded.client.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
            ... 3 more
Caused by: [CIRCULAR REFERENCE: alluxio.exception.status.UnknownException: Channel to localhost/127.0.0.1:29997: Append failed for file c8406978a05b73fe78c85a1fe0401fadeb69696c211d276f130f209116aa76a4]

Driver stacktrace:]

Alluxio error (left some additional debug logs around it for context if needed)

2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2424](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: e22b1d91-1ed6-4233-a3be-b31e79320e50
2024-08-23 00:34:43,049 DEBUG [NettyReaderExecutor-12443](LocalCacheManager.java:430) - putInternal(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,050 ERROR [NettyReaderExecutor-12817](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=f7e2d893c181d37ea09a20af015f59c381e3493ad4dbab67b8d43ba4a488c1cb, PageIndex=0} not available
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12817](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1042648, cache size 488064465
2024-08-23 00:34:43,050 DEBUG [NettyReaderExecutor-12839](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,049 DEBUG [worker-rpc-executor-TPE-thread-2434](AuthenticatedUserInjector.java:96) - Acquiring credentials for service-method: alluxio.grpc.block.BlockWorker/GetStatus on channel: 10930bcf-43ac-4b39-9053-9937cd42498e
2024-08-23 00:34:43,052 ERROR [NettyReaderExecutor-12795](DefaultPageMetaStore.java:223) - Invalid result returned by evictor: page PageId{FileId=a67ab3c108d458fe607c8e24f346de082724612baef30a3d2a6969b9e47707fc, PageIndex=0} not available
2024-08-23 00:34:43,053 ERROR [NettyReaderExecutor-12795](LocalCacheManager.java:461) - Unable to find page to evict: space used 487142919, page length 1048576, cache size 488064465
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12817](LocalCacheManager.java:338) - put(PageId{FileId=1d712123a45e300893de23c24703af413e98634c60fa588a73e0b3f22c9107a3, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12817](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:504) - Page PageId{FileId=ebf731c6c100e4b2cb802f7701be3f8bfe6443f914e941f970374026e27d9e52, PageIndex=0} is unavailable to evict, likely due to a benign race
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12860](LocalCacheManager.java:430) - putInternal(PageId{FileId=b2f20af747c70dea2c2b3a220e5bdb28e12ce0a3e94fd7df095584cfd283a9f9, PageIndex=0},1048576 bytes) enters
2024-08-23 00:34:43,053 DEBUG [NettyReaderExecutor-12795](LocalCacheManager.java:338) - put(PageId{FileId=eb9ea488437311a93e0aead7398d0892df232298485b5a814407d8250f575105, PageIndex=0},0 bytes) exits: false
2024-08-23 00:34:43,054 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from CHANNEL_IDLE to READING_DATA because of trigger REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from READING_DATA to SENDING_DATA because of trigger DATA_AVAILABLE
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,052 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:42) - State transitioned from COMPLETING_REQUEST to CHANNEL_IDLE because of trigger REQUEST_COMPLETED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12598](DebugLoggingTracer.java:37) - Trigger fired: REQUEST_RECEIVED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:42) - State transitioned from SENDING_DATA to EOF because of trigger OUTPUT_LENGTH_FULFILLED
2024-08-23 00:34:43,055 DEBUG [NettyReaderExecutor-12795](DebugLoggingTracer.java:37) - Trigger fired: COMPLETE_REQUES
CharlesBuell commented 2 months ago

@jja725 Hello, do you need any more information for this issue? Could it be possible to fix this through alluxio-site.properties? Thanks.

YichuanSun commented 2 months ago

how about set a bigger cache size in alluxio-site.properties?

YichuanSun commented 2 months ago

try to allocate more cache. modify alluxio.worker.page.store.sizes=<a bigger size>, the default size is 512MB

CharlesBuell commented 1 month ago

Increasing it does help a little in terms of how many queries are able to complete, but it doesn't completely solve the issue. Once I get to around the size of the dataset, then all the queries are able to complete.