Closed finestructure closed 1 year ago
Hold on, that segfault is from 10:01 this morning, so certainly not the issue.
Again, it seems like the segfault/signal 11 is a red herring and the hangs are due to something else but who knows what's going on.
This is on prod with Vapor 4.81.0, compiled with Swift 5.9 from Sep 1.
################################################################
# #
# Swift Nightly Docker Image #
# Tag: swift-5.9-DEVELOPMENT-SNAPSHOT-2023-09-01-a #
# #
################################################################
@finestructure If the corefile (indicated by (core dumped)
) is still present in the container that runs the command, it can be used to get a backtrace of the crash.
prod, 16:33 CET, Sep 14 2023
I missed the first alert and only checked on this now (21:37 CET (UTC+2)). I've pulled the logs and the closest Seg fault is at 13:28 UTC
2023-09-14T13:28:13.410953838Z /bin/bash: line 1: 29712 Segmentation fault (core dumped) ./Run analyze --env prod --limit 25
with processing continuing until 14:18 UTC while the connection timeout messages leading to the hang are appearing:
2023-09-14T14:18:09.519182656Z [ ERROR ] Connection request (ID 6 timed out. This might indicate a connection deadlock in your application. If you have long-running requests, consider increasing your connection timeout. [component: server, database-id: psql]
2023-09-14T14:18:09.529500376Z [ ERROR ] Connection request (ID 7 timed out. This might indicate a connection deadlock in your application. If you have long-running requests, consider increasing your connection timeout. [component: server, database-id: psql]
The are 15 seg faults in the log file I pulled (ranging from Sep 11 to Sep 14). I think it's safe to say that the seg faults aren't the cause of the hangs.
I looked for a core file in the running container but there was none in the executable's directory nor in a few other places I checked (/var/cache/abrt, /var/spool/abrt, /var/crash). Core dump size it unlimited:
root@fce97f658c3d:/app# ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
It's late and so I restarted the container for now. Can look into where it ends up some other time - these seem to happen frequently enough (15 times in 3 days).
BTW, the logs do not contain any stack trace info, despite the latest Vapor and Swift 5.9 🤔:
2023-09-14T13:28:08.682435359Z [ INFO ] pulling https://github.com/pointfreeco/swift-dependencies.git in /checkouts/github.com-pointfreeco-swift-dependencies [component: analyze]
2023-09-14T13:28:08.684063885Z [ INFO ] pulling https://github.com/Alamofire/AlamofireImage.git in /checkouts/github.com-alamofire-alamofireimage [component: analyze]
2023-09-14T13:28:09.417655788Z [ WARNING ] stderr: From https://github.com/team-telnyx/telnyx-webrtc-ios
2023-09-14T13:28:09.417694189Z * [new tag] 0.1.10 -> 0.1.10 [component: analyze]
2023-09-14T13:28:10.842271007Z [ INFO ] throttled 1 incoming revisions [component: analyze]
2023-09-14T13:28:10.937562626Z [ INFO ] throttled 1 incoming revisions [component: analyze]
2023-09-14T13:28:11.961565948Z [ INFO ] Updating 25 packages for stage 'analysis' (errors: 0) [component: analyze]
2023-09-14T13:28:13.410953838Z /bin/bash: line 1: 29712 Segmentation fault (core dumped) ./Run analyze --env prod --limit 25
2023-09-14T13:28:33.536090365Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-09-14T13:28:33.592712560Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2023-09-14T13:28:33.593526773Z [ INFO ] Updating 0 packages for stage 'analysis' (errors: 0) [component: analyze]
2023-09-14T13:28:55.544915191Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-09-14T13:28:55.619826266Z [ INFO ] Checkout directory: /checkouts [component: analyze]
@finestructure For next time, I'd suggest just searching the entire filesystem, e.g. find / -name core
(if that finds nothing, I'd give it one more shot with a more permissive search, like find / -iname '*core*'
).
Closing this as fixed now - we haven't had a hang since we removed the TaskGroup
in analysis in #2656 (released as 2.91.9).
Finally 🙂🎉
Huge thanks again to Gwynne for all the help!
I finally caught a glimpse as to why analysis sometimes hangs:
The backtrace itself isn't in the logs but hopefully the crash is reproducible with one of the packages in question.