Open russelltg opened 2 weeks ago
We merged a bunch of PRs lately, but did not observe this so far. are you using async mode for the rest API (i.e. getting a task.ID and waiting for it) ?
I would suspect the queue to somehow miss a task or not finish in here: https://github.com/aptly-dev/aptly/commit/45035802be4124e1b57acee335fa8ae8c035c90c
What distribution are you on ?
goxz is still used for building: https://github.com/aptly-dev/aptly/actions/runs/9545612734/job/26307439641, not sure how to get the symbols there.
I'm on Ubuntu 22.04, and no we do not use async mode--would this be preferable?
I would not use async mode, if your code does not need to do other things in parallel.
the nightly builds should come with debug symbols in the aptly binary:
file usr/bin/aptly
usr/bin/aptly: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=ZT3f2ryYxkeZgzWX4wzJ/qqXbXU1wLzvIZnQpg7Aw/qXraYWmrhjFAhqoaFH6U/GWQWRDivHfWPRXQKJvVH, with debug_info, not stripped
so you should be able to analyze the core dump, maybe you need to have the source code in the working dir.
It seems there is only debug symbols for the runtime itself:
```
(gdb) thread apply all bt
Thread 20 (LWP 22091):
#0 runtime.futex () at /__t/go/1.21.11/x64/src/runtime/sys_linux_amd64.s:558
#1 0x0000000000436cf0 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4650275) at /__t/go/1.21.11/x64/src/runtime/os_linux.go:69
#2 0x000000000040ecc7 in runtime.notesleep (n=0xc000075548) at /__t/go/1.21.11/x64/src/runtime/lock_futex.go:160
#3 0x00000000004414ec in runtime.mPark () at /__t/go/1.21.11/x64/src/runtime/proc.go:1634
#4 runtime.stopm () at /__t/go/1.21.11/x64/src/runtime/proc.go:2531
#5 0x00000000004458b6 in runtime.exitsyscall0 (gp=0xc0001be9c0) at /__t/go/1.21.11/x64/src/runtime/proc.go:4353
#6 0x000000000046b88e in runtime.mcall () at /__t/go/1.21.11/x64/src/runtime/asm_amd64.s:458
#7 0x0000000000000000 in ?? ()
Thread 19 (LWP 22090):
#0 runtime.futex () at /__t/go/1.21.11/x64/src/runtime/sys_linux_amd64.s:558
#1 0x0000000000436cf0 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=4650275) at /__t/go/1.21.11/x64/src/runtime/os_linux.go:69
#2 0x000000000040ecc7 in runtime.notesleep (n=0xc000100948) at /__t/go/1.21.11/x64/src/runtime/lock_futex.go:160
#3 0x00000000004414ec in runtime.mPark () at /__t/go/1.21.11/x64/src/runtime/proc.go:1634
#4 runtime.stopm () at /__t/go/1.21.11/x64/src/runtime/proc.go:2531
#5 0x0000000000442e1c in runtime.findRunnable (gp=
how did you obtain the core dump ? did aptly crash or did you trigger it ?
how are you invoking gdb ?
I see the source code when I pass the -d argument and provide the git repo (checkout b5bf2cbc Fix functional tests' '--capture' on Python 3):
gdb ./aptly -d ~/devel/aptly
which shows the source info:
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./aptly...
warning: File "/usr/share/go-1.19/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /usr/share/go-1.19/src/runtime/runtime-gdb.py
line to your configuration file "/home/lynx/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/home/lynx/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
(gdb) l
2
3 import (
4 "os"
5
6 "github.com/aptly-dev/aptly/aptly"
7 "github.com/aptly-dev/aptly/cmd"
8
9 _ "embed"
10 )
11
I attached to it and generatedd it with gdb. I actually think the stack I sent is full and correct, it's just in the runtime on all threads. I downloaded delve which can print goroutine stacks:
```
(dlv) goroutines -t
Goroutine 1 - User: /__t/go/1.21.11/x64/src/net/fd_unix.go:172 net.(*netFD).accept (0x590709) [IO wait 6204700077917805]
0 0x000000000043d84e in runtime.gopark
at /__t/go/1.21.11/x64/src/runtime/proc.go:399
1 0x00000000004360b7 in runtime.netpollblock
at /__t/go/1.21.11/x64/src/runtime/netpoll.go:564
2 0x0000000000467e25 in internal/poll.runtime_pollWait
at /__t/go/1.21.11/x64/src/runtime/netpoll.go:343
3 0x00000000004a8747 in internal/poll.(*pollDesc).wait
at /__t/go/1.21.11/x64/src/internal/poll/fd_poll_runtime.go:84
4 0x00000000004adc2c in internal/poll.(*pollDesc).waitRead
at /__t/go/1.21.11/x64/src/internal/poll/fd_poll_runtime.go:89
5 0x00000000004adc2c in internal/poll.(*FD).Accept
at /__t/go/1.21.11/x64/src/internal/poll/fd_unix.go:611
6 0x0000000000590709 in net.(*netFD).accept
at /__t/go/1.21.11/x64/src/net/fd_unix.go:172
7 0x00000000005a8a3e in net.(*TCPListener).accept
at /__t/go/1.21.11/x64/src/net/tcpsock_posix.go:152
8 0x00000000005a7bf0 in net.(*TCPListener).Accept
at /__t/go/1.21.11/x64/src/net/tcpsock.go:315
9 0x0000000000784bc4 in net/http.(*onceCloseListener).Accept
at
thanks for the backtrace !
does it look like aptly was shutting down bcs of a signal (gorouting 15) ?
how did the builds "hang", REST api did not return or timeout ? was aptly still responding to other APIs ? I would ne interesting to see the tasks, in case the queue lost on in a race condition.
Right, the REST api isn't returning. Both are stuck on /api/repos/{repo}/file/{dir}
endpoint. It happened again last night, here's another stack for you:
```
Goroutine 1 - User: /__t/go/1.21.11/x64/src/net/fd_unix.go:172 net.(*netFD).accept (0x590709) [IO wait 6721491349644173]
0 0x000000000043d84e in runtime.gopark
at /__t/go/1.21.11/x64/src/runtime/proc.go:399
1 0x00000000004360b7 in runtime.netpollblock
at /__t/go/1.21.11/x64/src/runtime/netpoll.go:564
2 0x0000000000467e25 in internal/poll.runtime_pollWait
at /__t/go/1.21.11/x64/src/runtime/netpoll.go:343
3 0x00000000004a8747 in internal/poll.(*pollDesc).wait
at /__t/go/1.21.11/x64/src/internal/poll/fd_poll_runtime.go:84
4 0x00000000004adc2c in internal/poll.(*pollDesc).waitRead
at /__t/go/1.21.11/x64/src/internal/poll/fd_poll_runtime.go:89
5 0x00000000004adc2c in internal/poll.(*FD).Accept
at /__t/go/1.21.11/x64/src/internal/poll/fd_unix.go:611
6 0x0000000000590709 in net.(*netFD).accept
at /__t/go/1.21.11/x64/src/net/fd_unix.go:172
7 0x00000000005a8a3e in net.(*TCPListener).accept
at /__t/go/1.21.11/x64/src/net/tcpsock_posix.go:152
8 0x00000000005a7bf0 in net.(*TCPListener).Accept
at /__t/go/1.21.11/x64/src/net/tcpsock.go:315
9 0x0000000000784bc4 in net/http.(*onceCloseListener).Accept
at
Re signals, I don't think so, I think that's just the goroutine responsible for waiting for signals.
It seems like the goroutines with apiReposPackageFromDir
in the stack are potentially problematic.
I'll leave the process running today, let me know if there is any info I can get for you by attaching. Did a little bit of digging and didn't see anything super obvious, but I'm not terribly familiar with the codebase.
looking at the previous backtrace, I think aptly is using running with -no-lock, probably as configured in the systemd service. This disables database locking, but I think regarding concurrency, locking that database would make sense.
could you try modifying the service and remove the -no-lock flag ?
for running aptly commands the service then needs to be stopped.
i tested with /api/publish but could not reproduce it. will try with /api/repos/{repo}/file/{dir} ...
I see. I thought this meant it would just lock during each transaction, which would allow for using aptly on the commandline.
I'll remove it and let you know if I hit more hangs.
We recently updated our aptly to nightly to get #1271 which we ran into a bunch, but woke up today to 2 builds hung during publish. We use the REST api.
Context
I did save a core file, are there binaries with symbols available? Happy to update this with backtraces if I can get symbols....