SwiftPackageIndex / SwiftPackageIndex-Server

The Swift Package Index is the place to find Swift packages!
https://swiftpackageindex.com
Apache License 2.0
526 stars 40 forks source link

Occasional prod crashes #3115

Closed finestructure closed 3 weeks ago

finestructure commented 2 months ago

We do see server restarts occasionally (a couple of times per day perhaps, across 3 prod nodes).

Example backtrace below, there's nothing I can glean from it.

Backtrace settings are:

enable=yes,images=all,interactive=no,output-to=stderr,registers=all,sanitize=no,swift-backtrace=./swift-backtrace-static,threads=all,timeout=0s,unwind=fast
2024-06-06 01:33:46.128 *** Signal 4: Backtracing from 0x55b0ef753c00... done ***

2024-06-06 01:33:46.128 

2024-06-06 01:33:46.128 *** Program crashed: Illegal instruction at 0x000055b0ef753c00 ***

2024-06-06 01:33:46.128 

2024-06-06 01:33:46.128 Thread 11 "NIO-SGLTN-0-#3" crashed:

2024-06-06 01:33:46.128 

2024-06-06 01:33:46.129 rax 0x000055b0f0770c88  d0 3b 75 ef b0 55 00 00 00 3c 75 ef b0 55 00 00  Ð;uï°U···<uï°U··

2024-06-06 01:33:46.129 rdx 0x00007f125d411000  68 b7 76 f0 b0 55 00 00 01 00 00 01 00 00 00 00  h·vð°U··········

2024-06-06 01:33:46.129 rcx 0x0000000000000000  0

2024-06-06 01:33:46.129 rbx 0x0000000000000001  1

2024-06-06 01:33:46.129 rsi 0x00007f12655ddd60  18 f4 76 f0 b0 55 00 00 90 09 78 f0 b0 55 00 00  ·ôvð°U····xð°U··

2024-06-06 01:33:46.129 rdi 0x00007f125d375000  18 f4 76 f0 b0 55 00 00 90 09 78 f0 b0 55 00 00  ·ôvð°U····xð°U··

2024-06-06 01:33:46.129 rbp 0x00007f125cc31400  08 09 77 f0 b0 55 00 00 00 6e 00 00 00 00 00 00  ··wð°U···n······

2024-06-06 01:33:46.129 rsp 0x00007f1249ff30b8  31 9b 68 ef b0 55 00 00 08 09 77 f0 b0 55 00 00  1·hï°U····wð°U··

2024-06-06 01:33:46.129  r8 0x00007f12656008c0  c8 00 00 00 c8 00 00 00 c8 00 00 00 c8 00 00 00  È···È···È···È···

2024-06-06 01:33:46.129  r9 0x0000000000000000  0

2024-06-06 01:33:46.129 r10 0x0000000000000040  64

2024-06-06 01:33:46.130 r11 0x00007f1249ff7640  40 76 ff 49 12 7f 00 00 90 61 27 64 12 7f 00 00  @vÿI·····a'd····

2024-06-06 01:33:46.130 r12 0x00007f1249ff4db4  00 00 00 00 80 df 50 5d 12 7f 00 00 00 00 00 00  ·····ßP]········

2024-06-06 01:33:46.130 r13 0x00007f1249ff3890  65 6e 5f 55 53 5f 50 4f 53 49 58 00 00 00 00 00  en_US_POSIX·····

2024-06-06 01:33:46.130 r14 0x00007f125d375000  18 f4 76 f0 b0 55 00 00 90 09 78 f0 b0 55 00 00  ·ôvð°U····xð°U··

2024-06-06 01:33:46.130 r15 0x00007f1249ff3168  90 09 78 f0 b0 55 00 00 e8 00 00 00 00 00 00 00  ··xð°U··è·······

2024-06-06 01:33:46.130 rip 0x000055b0ef753c00  0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00  ··fffff.········

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.130 rflags 0x0000000000010206  PF

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.130 cs 0x0033  fs 0x0000  gs 0x0000

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.130 0  0x000055b0ef753c00 icu_69_swift::TimeZone::~TimeZone() in Run

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.130 Images:

2024-06-06 01:33:46.130 

2024-06-06 01:33:46.131 0x000055b0ec466000–0x000055b0f05af0a8 <no build ID>                            Run                      /app/Run

2024-06-06 01:33:46.131 0x00007f1265891000–0x00007f12658993a1 59c2a6b204f74f358ca7711d2dfd349d88711f6a libffi.so.8.1.0          /usr/lib/x86_64-linux-gnu/libffi.so.8.1.0

2024-06-06 01:33:46.131 0x00007f126589e000–0x00007f12658aa239 4177015ccbe670995ed128899633b71e6ca397ef libresolv.so.2           /usr/lib/x86_64-linux-gnu/libresolv.so.2

2024-06-06 01:33:46.131 0x00007f12658b2000–0x00007f12658b5411 ff27227afa5eeddccab180dd29bd7fcff94aea7c libkeyutils.so.1.9       /usr/lib/x86_64-linux-gnu/libkeyutils.so.1.9

2024-06-06 01:33:46.131 0x00007f12658b9000–0x00007f12658c9219 2fde6ecb43c586fe4077118f771077aa1298e7ea libtasn1.so.6.6.2        /usr/lib/x86_64-linux-gnu/libtasn1.so.6.6.2

2024-06-06 01:33:46.131 0x00007f12658d3000–0x00007f126599f3ad a0ffe1d002de5812dc718186172efb78604ddf2c libp11-kit.so.0.3.0      /usr/lib/x86_64-linux-gnu/libp11-kit.so.0.3.0

2024-06-06 01:33:46.131 0x00007f1265a0e000–0x00007f1265a0f5c5 43a72967cf84155914c8b3e915926733d1e57c11 libbrotlicommon.so.1.0.9 /usr/lib/x86_64-linux-gnu/libbrotlicommon.so.1.0.9

2024-06-06 01:33:46.131 0x00007f1265a31000–0x00007f1265a440f1 b8a924f277180f8743a0c6463a379b1a4ad1aae5 libsasl2.so.2.0.25       /usr/lib/x86_64-linux-gnu/libsasl2.so.2.0.25

2024-06-06 01:33:46.131 0x00007f1265a4c000–0x00007f1265a54a31 384ac9986064cbae89ac8b2e33eb3604b5ab0395 libkrb5support.so.0.1    /usr/lib/x86_64-linux-gnu/libkrb5support.so.0.1

2024-06-06 01:33:46.131 0x00007f1265a5a000–0x00007f1265a5cf69 ce0901f10854b3c9276066b98d9a72303206e0d5 libcom_err.so.2.1        /usr/lib/x86_64-linux-gnu/libcom_err.so.2.1

2024-06-06 01:33:46.131 0x00007f1265a62000–0x00007f1265a80601 f7765f61555fd50c7960ea84cbb9609e7062a667 libk5crypto.so.3.1       /usr/lib/x86_64-linux-gnu/libk5crypto.so.3.1

2024-06-06 01:33:46.131 0x00007f1265a91000–0x00007f1265b0eb85 fa7fe8222982647202746ba1275861042f966623 libkrb5.so.3.3           /usr/lib/x86_64-linux-gnu/libkrb5.so.3.3

2024-06-06 01:33:46.131 0x00007f1265b5c000–0x00007f1265bc455d f110719303ddbea25a5e89ff730fec520eed67b0 libgmp.so.10.4.1         /usr/lib/x86_64-linux-gnu/libgmp.so.10.4.1

2024-06-06 01:33:46.131 0x00007f1265bde000–0x00007f1265c0a7e9 3d9c6bf106ef53d625b7b1c8bb1300e84598a74a libnettle.so.8.4         /usr/lib/x86_64-linux-gnu/libnettle.so.8.4

2024-06-06 01:33:46.131 0x00007f1265c24000–0x00007f1265c3eb99 3cc4a3474de72db89e9dcc93bfb95fe377f48c37 libhogweed.so.6.4        /usr/lib/x86_64-linux-gnu/libhogweed.so.6.4

2024-06-06 01:33:46.131 0x00007f1265c6e000–0x00007f1265dc79cd a4f5895a25b60e67c635e4a627c0c832f408707a libgnutls.so.30.31.0     /usr/lib/x86_64-linux-gnu/libgnutls.so.30.31.0

2024-06-06 01:33:46.131 0x00007f1265e59000–0x00007f1265e9fb66 ca5149da8d5a298b8f286ffca3d6e2402ec0fe01 libunistring.so.2.2.0    /usr/lib/x86_64-linux-gnu/libunistring.so.2.2.0

2024-06-06 01:33:46.131 0x00007f1266003000–0x00007f126600b0e9 4b1f390dd6e24d49684db8b2443d082379e8e977 libbrotlidec.so.1.0.9    /usr/lib/x86_64-linux-gnu/libbrotlidec.so.1.0.9

2024-06-06 01:33:46.131 0x00007f1266011000–0x00007f12660ccf11 5d9d0d946a3154a748e87e17af9d14764519237b libzstd.so.1.4.8         /usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8

2024-06-06 01:33:46.131 0x00007f12660e0000–0x00007f12660ea351 9a6051abba09b299502026bc0668696c2321f743 liblber-2.5.so.0.1.11    /usr/lib/x86_64-linux-gnu/liblber-2.5.so.0.1.11

2024-06-06 01:33:46.131 0x00007f12660f3000–0x00007f126613ae19 bd98d8e3f3b07515a88085a691678fafba88763e libldap-2.5.so.0.1.11    /usr/lib/x86_64-linux-gnu/libldap-2.5.so.0.1.11

2024-06-06 01:33:46.131 0x00007f1266152000–0x00007f1266194935 0204867906c20c8d6c55fb2a7a9950fa96c34b53 libgssapi_krb5.so.2.2    /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2.2

2024-06-06 01:33:46.131 0x00007f12661a6000–0x00007f12664b6e31 ca84e22a798dabf117de600cb13469a59f775d2a libcrypto.so.3           /usr/lib/x86_64-linux-gnu/libcrypto.so.3

2024-06-06 01:33:46.131 0x00007f12665ea000–0x00007f1266662eed 9e2ad2d446d1e0e442154f9df9a3daf6a04bd645 libssl.so.3              /usr/lib/x86_64-linux-gnu/libssl.so.3

2024-06-06 01:33:46.131 0x00007f126668e000–0x00007f1266691eb1 2b1afc1a3bc8bdb016e432c50db058632e7895b9 libpsl.so.5.3.2          /usr/lib/x86_64-linux-gnu/libpsl.so.5.3.2

2024-06-06 01:33:46.131 0x00007f12666a2000–0x00007f12666f1f25 b7fd78c4b3609dd386755b09bba2fb6b096d906b libssh.so.4.8.7          /usr/lib/x86_64-linux-gnu/libssh.so.4.8.7

2024-06-06 01:33:46.131 0x00007f1266711000–0x00007f1266725fe9 9517ef375cd71ea3da824b4118f1599735093d66 librtmp.so.1             /usr/lib/x86_64-linux-gnu/librtmp.so.1

2024-06-06 01:33:46.131 0x00007f1266730000–0x00007f1266735589 45b73e0e1c46a76be22f572ee98c60af5768bf8f libidn2.so.0.3.7         /usr/lib/x86_64-linux-gnu/libidn2.so.0.3.7

2024-06-06 01:33:46.131 0x00007f1266751000–0x00007f126676a755 7eb36cd3bf6193f1814e364624f49de6fbbab3ac libnghttp2.so.14.20.1    /usr/lib/x86_64-linux-gnu/libnghttp2.so.14.20.1

2024-06-06 01:33:46.131 0x00007f126677b000–0x00007f1266937341 c289da5071a3399de893d2af81d6a30c62646e1e libc.so.6                /usr/lib/x86_64-linux-gnu/libc.so.6

2024-06-06 01:33:46.131 0x00007f12669a4000–0x00007f12669bd815 e3a44e0da9c6e835d293ed8fd2882b4c4a87130c libgcc_s.so.1            /usr/lib/x86_64-linux-gnu/libgcc_s.so.1

2024-06-06 01:33:46.131 0x00007f12669c4000–0x00007f1266a6663d ffdb577fa78a283f51714728e64d35d30bac41a9 libjemalloc.so.2         /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

2024-06-06 01:33:46.131 0x00007f1266ca0000–0x00007f1266cb2c21 30840b79ac329ecbf1dec0bb60180eed256d319f libz.so.1.2.11           /usr/lib/x86_64-linux-gnu/libz.so.1.2.11

2024-06-06 01:33:46.131 0x00007f1266cbc000–0x00007f1266e66fd1 e37fe1a879783838de78cbc8c80621fa685d58a2 libstdc++.so.6.0.30      /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30

2024-06-06 01:33:46.131 0x00007f1266ee8000–0x00007f1266f718d5 a88ef0199bd5e742ebd0c359edf5cb2be0ec08b5 libm.so.6                /usr/lib/x86_64-linux-gnu/libm.so.6

2024-06-06 01:33:46.131 0x00007f1266fcf000–0x00007f1267052261 80efbebb09b1157e61b3ad983da1ec19e6acdfad libcurl.so.4.7.0         /usr/lib/x86_64-linux-gnu/libcurl.so.4.7.0

2024-06-06 01:33:46.131 0x00007f1267076000–0x00007f126717f4dd 87d00c07f9d8a822a4251d4f053484e5c0e4ead8 libsqlite3.so.0.8.6      /usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6

2024-06-06 01:33:46.131 0x00007f12671c8000–0x00007f12671f3315 15921ea631d9f36502d20459c43e5c85b7d6ab76 ld-linux-x86-64.so.2     /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2

2024-06-06 01:33:46.131 

2024-06-06 01:33:46.131 Backtrace took 76.22s

2024-06-06 01:33:46.131 

2024-06-06 01:33:53.237 [ NOTICE ] Server starting on http://0.0.0.0:80 [component: server]

2024-06-06 01:37:42.851 [ NOTICE ] Server starting on http://0.0.0.0:80 [component: server]
finestructure commented 2 months ago

Another one from last night:

  |   | 2024-06-06 00:33:47.400 | *** Signal 11: Backtracing from 0x56187107d031... done *** |  
  |   | 2024-06-06 00:33:47.400 |   |  
  |   | 2024-06-06 00:33:47.400 | *** Program crashed: Bad pointer dereference at 0x0000000000000008 *** |  
  |   | 2024-06-06 00:33:47.400 |   |  
  |   | 2024-06-06 00:33:47.401 | Thread 10 "NIO-SGLTN-0-#1" crashed: |  
  |   | 2024-06-06 00:33:47.401 |   |  
  |   | 2024-06-06 00:33:47.401 | rax 0x0000000000000000  0 |  
  |   | 2024-06-06 00:33:47.401 | rdx 0x00007f4239df0900  98 ce 15 72 18 56 00 00 01 00 00 01 00 00 00 00  ·Î·r·V·········· |  
  |   | 2024-06-06 00:33:47.401 | rcx 0x0000000000000003  3 |  
  |   | 2024-06-06 00:33:47.401 | rbx 0xffffffffffffe708  18446744073709545224 |  
  |   | 2024-06-06 00:33:47.401 | rsi 0x00007f423c496f40  00 00 00 00 00 00 00 00 80 ce 10 44 42 7f 00 00  ·········Î·DB··· |  
  |   | 2024-06-06 00:33:47.401 | rdi 0x00007f42269f5f40  c9 00 00 00 e4 00 00 00 ff ff ff ff 18 00 00 00  É···ä···ÿÿÿÿ···· |  
  |   | 2024-06-06 00:33:47.401 | rbp 0x00007f42269f4e20  00 00 00 00 00 00 00 00 83 1e cf 70 18 56 00 00  ··········Ïp·V·· |  
  |   | 2024-06-06 00:33:47.401 | rsp 0x00007f42269f30a0  00 09 df 39 42 7f 00 00 59 8b 0c 71 18 56 00 00  ··ß9B···Y··q·V·· |  
  |   | 2024-06-06 00:33:47.402 | r8 0x00007f42442008c0  c8 00 00 00 c8 00 00 00 c8 00 00 00 c8 00 00 00  È···È···È···È··· |  
  |   | 2024-06-06 00:33:47.402 | r9 0x0000000000000500  1280 |  
  |   | 2024-06-06 00:33:47.402 | r10 0x0000000000000018  24 |  
  |   | 2024-06-06 00:33:47.402 | r11 0x00007f42269f7640  40 76 9f 26 42 7f 00 00 10 a5 62 3a 42 7f 00 00  @v·&B····¥b:B··· |  
  |   | 2024-06-06 00:33:47.402 | r12 0x00000000ffffffff  4294967295 |  
  |   | 2024-06-06 00:33:47.402 | r13 0x00007f42269f38d0  65 6e 5f 55 53 5f 50 4f 53 49 58 00 00 00 00 00  en_US_POSIX····· |  
  |   | 2024-06-06 00:33:47.402 | r14 0x00007f42269f31a8  00 90 16 72 18 56 00 00 65 6e 00 00 00 00 00 00  ···r·V··en······ |  
  |   | 2024-06-06 00:33:47.402 | r15 0x00000000ffffffff  4294967295 |  
  |   | 2024-06-06 00:33:47.402 | rip 0x000056187107d031  ff 50 08 48 89 df 5b e9 33 3a 1f 00 0f 1f 00 0f  ÿP·H·ß[é3:······ |  
  |   | 2024-06-06 00:33:47.402 |   |  
  |   | 2024-06-06 00:33:47.402 | rflags 0x0000000000010202 |  
  |   | 2024-06-06 00:33:47.402 |   |  
  |   | 2024-06-06 00:33:47.402 | cs 0x0033  fs 0x0000  gs 0x0000 |  
  |   | 2024-06-06 00:33:47.402 |   |  
  |   | 2024-06-06 00:33:47.402 | 0  0x000056187107d031 icu_69_swift::Calendar::~Calendar() + 33 in Run |  
  |   | 2024-06-06 00:33:47.402 |   |  
  |   | 2024-06-06 00:33:47.402 |  

I notice

0x000055b0ef753c00 icu_69_swift::TimeZone::~TimeZone() in Run

in the first and

0x000056187107d031 icu_69_swift::Calendar::~Calendar() + 33 in Run |  

in the second. Both referencing icu_69_swift 🤔

weissi commented 2 months ago

The crash is in Swift's ICU copy, almost certainly because Foundation was reaching into ICU. I reckon the reason you don't get backtraces is presumably because Foundation hasn't been compiled with frame pointers, @al45tair is this expected? @finestructure what Swift version is this and where is it from?

@finestructure As a pretty uninformed guess this is likely because some object in corelibs-foundation isn't thread safe that should be thread safe. For example could it be this? https://github.com/apple/swift-corelibs-foundation/issues/3166

I assume you're using DateFormatters somewhere, are you sure you're not sharing them across threads?

finestructure commented 2 months ago

Thanks for the quick response, @weissi !

This is swift:5.10.0-jammy on x86.

We do use DateFormatters in a couple of places, let me double check their situation.

finestructure commented 2 months ago

I don't think we're sharing formatters, unless the following pattern is problematic

private extension DateFormatter {
    static let iso8601Full: DateFormatter = {
        let formatter = DateFormatter()
        formatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ss.SSSZZZZZ"
        formatter.calendar = Calendar(identifier: .iso8601)
        formatter.timeZone = TimeZone(secondsFromGMT: 0)
        return formatter
    }()
}

My understanding is that the static initialisation is guarded by something like dispatch_once and ok - or could that in fact be the culprit? We're using that pattern quite a bit.

weissi commented 2 months ago

I don't think we're sharing formatters, unless the following pattern is problematic

private extension DateFormatter {
    static let iso8601Full: DateFormatter = {
        let formatter = DateFormatter()
        formatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ss.SSSZZZZZ"
        formatter.calendar = Calendar(identifier: .iso8601)
        formatter.timeZone = TimeZone(secondsFromGMT: 0)
        return formatter
    }()
}

My understanding is that the static initialisation is guarded by something like dispatch_once and ok - or could that in fact be the culprit? We're using that pattern quite a bit.

Yes, this is problematic, particularly on Linux. On macOS this is better but still difficult because DateFormatter is a class (https://developer.apple.com/documentation/foundation/dateformatter).

You're caching one instance of a DateFormatter and then you're sharing it across threads. So if it's ever mutated anywhere, that'd be a problem.

You might want to change it to

static var iso8601Full: DateFormatter {
        let formatter = DateFormatter()
        formatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ss.SSSZZZZZ"
        formatter.calendar = Calendar(identifier: .iso8601)
        formatter.timeZone = TimeZone(secondsFromGMT: 0)
        return formatter
    }

and see if that improves things. This will create a fresh one every time.

finestructure commented 2 months ago

Ah right, of course. I don't think we mutate them but I better just make sure we don't use this pattern to rule it out. Thanks, @weissi !

finestructure commented 2 months ago

FWIW, the DateFormatter changes on our end didn't have any noticeable effect on the crash rate.

weissi commented 2 months ago

FWIW, the DateFormatter changes on our end didn't have any noticeable effect on the crash rate.

Right. I appreciate that we're looking for needle in the haystack here but it might be worth looking into other places that share objects from corelibs-foundation that are related to date/time. Just because we know it's in the bundled ICU libraries in the TimeZone::~Timezone destructor. I believe corelibs-foundation is the only client of icu_69_swift...

finestructure commented 1 month ago

There have been no backtraces in our logs over at least the last 7 days, if not longer (our logs don't go back further and I didn't check in a while). I do see a number of server restarts, many of which will be deployments and maintenance restarts.

In order to be sure this is fixed, I'm going to enable backtraces in prod where we have fewer deployments, making it easier to line them up with restarts.

finestructure commented 1 month ago

Ok, definitely still there. Two crashes this morning: https://gist.github.com/finestructure/e0e8dedd9cc15648e2f0b3bdba054938

One of them is inside our dependency Plot, which is also using the problematic static dateFormatter closure:

internal extension RSS {
    static let dateFormatter: DateFormatter = {
        let formatter = DateFormatter()
        formatter.dateFormat = "E, d MMM yyyy HH:mm:ss Z"
        formatter.locale = Locale(identifier: "en_US_POSIX")
        return formatter
    }()
}

(We're already using a fork of Plot, so that's something we can easily address.)

al45tair commented 1 month ago

I reckon the reason you don't get backtraces is presumably because Foundation hasn't been compiled with frame pointers, @al45tair is this expected?

Hmmm. Yes, probably. I think I turned them on for the Swift runtime itself, but I don't think I've done anything for Foundation. (rdar://132390751 tracks turning them on for Foundation.)

finestructure commented 3 weeks ago

I think we've squashed them all now - there have been no crashes on either env since we deployed the latest fix. I'll close this for now, will reopen in case this pops up again.

Thanks again @weissi for your help!

weissi commented 3 weeks ago

I think we've squashed them all now - there have been no crashes on either env since we deployed the latest fix. I'll close this for now, will reopen in case this pops up again.

Thanks again @weissi for your help!

Great work!! And no worries, honestly I can mostly be replaced with a bot that says "threading bug" when people report somewhat non-deterministic or prod-only crashes in Swift programs ;)