apache / incubator-uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.
https://uniffle.apache.org/
Apache License 2.0
376 stars 145 forks source link

[Bug] ShuffleTaskManager.commitShuffle will get stuck forever if an exception occurs during the flush process #1863

Open rickyma opened 3 months ago

rickyma commented 3 months ago

Code of Conduct

Search before asking

Describe the bug

image

Affects Version(s)

master

Uniffle Server Log Output

jstack:

"Grpc-1788" #2073 daemon prio=5 os_prio=0 cpu=1723.11ms elapsed=88729.16s tid=0x00007f3d3c0f1000 nid=0x968 waiting for monitor entry [0x00007f3cf97fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:338)
        - waiting to lock <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

"Grpc-1359" #1629 daemon prio=5 os_prio=0 cpu=5536.44ms elapsed=88733.96s tid=0x00007f4380185800 nid=0x7ac waiting on condition [0x00007f41156fe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.uniffle.server.ShuffleTaskManager.commitShuffle(ShuffleTaskManager.java:360)
        - locked <0x00007f4fbf708e00> (a java.lang.Object)
        at org.apache.uniffle.server.ShuffleServerGrpcService.finishShuffle(ShuffleServerGrpcService.java:468)
        at org.apache.uniffle.proto.ShuffleServerGrpc$MethodHandlers.invoke(ShuffleServerGrpc.java:1060)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:356)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)    

exception log:

[2024-07-03 08:54:32.973] [HadoopFlushEventThreadPool-1] [WARN] SingleStorageManager.write - Exception happened when write data for ShuffleDataFlushEvent: eventId=252896, appId=application_1716779728283_6825960_1719966578466, shuffleId=0, startPartition=315, endPartition=315, retryTimes=0, underStorage=HadoopStorage, isPended=false, ownedByHugePartition=false, try again
org.apache.uniffle.common.exception.RssException: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.uniffle.storage.handler.impl.HadoopShuffleWriteHandler.write(HadoopShuffleWriteHandler.java:157)
        at org.apache.uniffle.storage.handler.impl.PooledHadoopShuffleWriteHandler.write(PooledHadoopShuffleWriteHandler.java:122)
        at org.apache.uniffle.server.storage.SingleStorageManager.write(SingleStorageManager.java:59)
        at org.apache.uniffle.server.storage.HybridStorageManager.write(HybridStorageManager.java:130)
        at org.apache.uniffle.server.ShuffleFlushManager.processFlushEvent(ShuffleFlushManager.java:165)
        at org.apache.uniffle.server.DefaultFlushEventHandler.handleEventAndUpdateMetrics(DefaultFlushEventHandler.java:97)
        at org.apache.uniffle.server.DefaultFlushEventHandler.lambda$dispatchEvent$0(DefaultFlushEventHandler.java:219)
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:9003,DS-3ad04d12-7d78-405f-ba33-d2bb706f073d,DISK]] are bad. Aborting...
        at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1567)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1501)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1487)
        at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1262)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:673)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

sahibamatta commented 3 months ago

Hi @rickyma, I'm willing to contribute to it. I can raise a PR if you are ok? Thanks!

rickyma commented 3 months ago

Sure. I'll assign this to you. @sahibamatta

sahibamatta commented 3 months ago

Hi @rickyma , I've raised a PR for it as per my understanding of the issue 😅. For now, it just handles the exception thrown from the write method, as per mentioned in the screenshot above. Please let me know if we need to handle other parts of the processFlushEvent method as well? Also, feel free to let me know if there’s any gap in my understanding. Thanks!