apple / swift-distributed-actors

Peer-to-peer cluster implementation for Swift Distributed Actors
https://apple.github.io/swift-distributed-actors/
Apache License 2.0
580 stars 54 forks source link

Crash when using linux+asan+static-swift-stdlib+swift 5.8 #1118

Closed mannuch closed 1 year ago

mannuch commented 1 year ago

Hello!

Ran into an issue when running the service in a release configuration on Linux via docker.

After some digging, I believe I've isolated the issue to when the cluster is initialized. I have a reproduction of the issue with a simple main.swift:

import DistributedCluster

let clusterSystem = await ClusterSystem("TestRunCluster")
try await Task.sleep(for: .seconds(5))

When running with Backtrace installed, I get the following:

Received signal 11. Backtrace:
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaad5a3b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaad5a3979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
2023-04-23T07:46:26+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaad5a3971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Binding to: [sact://TestRunCluster@127.0.0.1:7337]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster/leadership cluster/node=sact://TestRunCluster@127.0.0.1:7337 leadership/election=DistributedCluster.Leadership.LowestReachableMember [DistributedCluster] Not enough members [1/2] to run election, members: [Member(sact://TestRunCluster:2481186327279040895@127.0.0.1:7337, status: joining, reachability: reachable)]
2023-04-23T07:46:26+0000 info TestRunCluster : actor/path=/system/cluster cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Bound to [IPv4]127.0.0.1/127.0.0.1:7337

With the backtrace sending a signal 11, I tried using AddressSanitizer to see if I could get more information, which ended up giving me:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0xffff819de570 sp 0xffff819de560 T3)
==1==Hint: pc points to the zero page.
==1==The signal is caused by a READ memory access.
==1==Hint: address points to the zero page.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] ClusterSystem [TestRunCluster] initialized, listening on: sact://TestRunCluster@127.0.0.1:7337: _ActorRef<ClusterShell.Message>(/system/cluster)
    #0 0x0  (<unknown module>)
    #1 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #2 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #3 0xaaaac2c2008c  (/CrashingCluster+0x1e4008c)
    #4 0xaaaac2c1fdf4  (/CrashingCluster+0x1e3fdf4)
    #5 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #6 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #7 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

AddressSanitizer can not provide additional info.
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .autoLeaderElection: LeadershipSelectionSettings(underlying: DistributedCluster.ClusterSystemSettings.LeadershipSelectionSettings.(unknown context at $aaaac374b1dc)._LeadershipSelectionSettings.lowestReachable(minNumberOfMembers: 2))
SUMMARY: AddressSanitizer: SEGV (<unknown module>)
Thread T3 created by T1 here:
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c2b694  (/CrashingCluster+0x1e4b694)
    #3 0xaaaac2c24c04  (/CrashingCluster+0x1e44c04)
    #4 0xaaaac2c2c098  (/CrashingCluster+0x1e4c098)
    #5 0xffff85f7d5c4  (/lib/aarch64-linux-gnu/libc.so.6+0x7d5c4) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #6 0xffff85fe5d18  (/lib/aarch64-linux-gnu/libc.so.6+0xe5d18) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)

Thread T1 created by T0 here:
2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .downingStrategy: DowningStrategySettings(underlying: DistributedCluster.DowningStrategySettings.(unknown context at $aaaac374979c)._DowningStrategySettings.timeout(DistributedCluster.TimeoutBasedDowningStrategySettings(downUnreachableMembersAfter: 1.0 seconds)))
    #0 0xaaaac149fb68  (/CrashingCluster+0x6bfb68)
    #1 0xaaaac2c28478  (/CrashingCluster+0x1e48478)
    #2 0xaaaac2c634cc  (/CrashingCluster+0x1e834cc)
    #3 0xaaaac2c6293c  (/CrashingCluster+0x1e8293c)
    #4 0xaaaac2c62014  (/CrashingCluster+0x1e82014)
    #5 0xaaaac2c62754  (/CrashingCluster+0x1e82754)
    #6 0xaaaac18ce5b4  (/CrashingCluster+0xaee5b4)
    #7 0xffff85f273f8  (/lib/aarch64-linux-gnu/libc.so.6+0x273f8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #8 0xffff85f274c8  (/lib/aarch64-linux-gnu/libc.so.6+0x274c8) (BuildId: f37f3aa07c797e333fd106472898d361f71798f5)
    #9 0xaaaac143efac  (/CrashingCluster+0x65efac)

2023-04-23T18:49:29+0000 info TestRunCluster : cluster/node=sact://TestRunCluster@127.0.0.1:7337 [DistributedCluster] Setting in effect: .onDownAction: OnDownActionStrategySettings(underlying: DistributedCluster.OnDownActionStrategySettings.(unknown context at $aaaac374971c)._OnDownActionStrategySettings.gracefulShutdown(delay: 3.0 seconds))
==1==ABORTING

As far as I can tell, the problem only seems to arise when running on Linux with this Dockerfile:

# ================================
# Build image
# ================================
FROM swift:5.8-jammy as builder

RUN mkdir /workspace
WORKDIR /workspace

COPY . /workspace

RUN swift build --sanitize=address -c release -Xswiftc -g --static-swift-stdlib

# ================================
# Run image
# ================================
FROM ubuntu:jammy

COPY --from=builder /workspace/.build/release/CrashingCluster /

EXPOSE 7337

ENTRYPOINT ["./CrashingCluster"]

This reproduction, along with the Dockerfile, can be found in this repo, if it helps.

Thanks for all the work on this!

mannuch commented 1 year ago

However, when running with bind mounts to the local filesystem via

docker run -v "$PWD:/code" -w /code swift:latest swift run -c release

my original application, as well as the reproducer linked above, appear to work.

ktoso commented 1 year ago

Thanks for the bug report!

We continued looking into this and strongly suspect that this is a bug with Swift 5.8 and --static-swift-stdlib together with asan (address sanitizer).

I'll quadruple check some more but that's our strong suspicion so far.

It also does not reproduce on Swift 5.9 and we suspect this might be a fix for it: https://github.com/apple/swift/pull/65254

mannuch commented 1 year ago

Ahh okay, got it. Running the Swift 5.8 container without --static-swift-stdlib seems to be avoiding the issue, so I'll go with that for now!

Thanks for the timely help with this!

ktoso commented 1 year ago

Thanks for confirming, I'll close this as I believe this is a static linking issue with concurrency library in general