coreos / rpm-ostree

⚛📦 Hybrid image/package system with atomic upgrades and package layering
https://coreos.github.io/rpm-ostree
Other
857 stars 193 forks source link

thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)' #4284

Closed YaLTeR closed 1 year ago

YaLTeR commented 1 year ago

Host system details

● fedora:fedora/rawhide/x86_64/silverblue
                  Version: Rawhide.20230128.n.0 (2023-01-28T06:06:38Z)
               BaseCommit: 50f6d64e26e190a512a14cffbf7a7137771b56f2ef46d77a8ea1a61ef835755e
             GPGSignature: Valid signature by 6A51BBABBA3D5467B6171221809A8D7CEB10B464
      RemovedBasePackages: adwaita-qt5 1.4.2-2.fc38 firefox firefox-langpacks 109.0-1.fc38
                           qgnomeplatform-qt5 0.9.0-8.fc38
          LayeredPackages: alacritty dnf exa fish flatpak-builder gdb gjs-debuginfo glib2-debuginfo
                           gnome-shell-debuginfo hotspot langpacks-ru mesa-dri-drivers-debuginfo mozjs102-debuginfo
                           mutter-debuginfo perf sysprof
                Initramfs: --include /etc/initramfs-overlay /

Expected vs actual behavior

$ rpm-ostree update
⠒ Scanning metadata: 3
Enabled rpm-md repositories: fedora-cisco-openh264 rawhide-modular rawhide rawhide-debuginfo rawhide-source fedora-rawhide-nodebug
⠴ Scanning metadata: 3
thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'rpm-ostree update' terminated by signal SIGABRT (Abort)

Expected:

No panic.

Steps to reproduce it

Not reproducible.

Would you like to work on the issue?

Should be assigned to someone else.

cgwalters commented 1 year ago

Do you have a coredump from this? Try using coredumpctl and coredumpctl info to try to extract a stack trace.

YaLTeR commented 1 year ago

I think there's some issue with how rpm-ostree is built because there's nothing in the stack trace:

                #0  0x00007f87f7c6cb94 __pthread_kill_implementation (libc.so.6 + 0x8eb94)
                #1  0x00007f87f7c1baee raise (libc.so.6 + 0x3daee)
                #2  0x00007f87f7c0487f abort (libc.so.6 + 0x2687f)
                #3  0x0000558a579cc927 _ZN11panic_abort18__rust_start_panic5abort17h6ac263e9d2a6eb7eE (rpm-ostree + 0x985927)
                #4  0x00007f87e4021b30 n/a (n/a + 0x0)

I also tried reproducing with RUST_BACKTRACE=1 and that's empty too. Looks like the debug symbols are stripped but not available in any other package.

YaLTeR commented 1 year ago
┌ ~
└─ env RUST_BACKTRACE=full rpm-ostree update
⠤ Receiving metadata objects: 1/(estimating) 49 bytes/s 196 bytes                                                       2 metadata, 0 content objects fetched; 788 B transferred in 5 seconds; 0 bytes content written
Receiving metadata objects: 1/(estimating) 49 bytes/s 196 bytes... done
⠒ Checking out tree 0e96fdf...                                                                                          Enabled rpm-md repositories: fedora-cisco-openh264 rawhide-modular rawhide rawhide-debuginfo rawhide-source fedora-rawhide-nodebug
⠐ Checking out tree 0e96fdf...                                                                                          thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
stack backtrace:
   0:     0x55adf3155b4a - <unknown>
   1:     0x55adf317ef8e - <unknown>
   2:     0x55adf314e485 - <unknown>
   3:     0x55adf3155915 - <unknown>
   4:     0x55adf315705f - <unknown>
   5:     0x55adf3156d9b - <unknown>
   6:     0x55adf315776c - <unknown>
   7:     0x55adf31574c2 - <unknown>
   8:     0x55adf3155ffc - <unknown>
   9:     0x55adf3157212 - <unknown>
  10:     0x55adf2952b83 - <unknown>
  11:     0x55adf2952c5d - <unknown>
  12:     0x55adf2be0665 - <unknown>
  13:     0x55adf2a7875f - <unknown>
⠐ Checking out tree 0e96fdf...                                                                                            14:     0x7fe1084673a2 - g_closure_invoke
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gclosure.c:832:7
  15:     0x7fe108495f58 - signal_emit_unlocked_R.isra.0
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3796:8
  16:     0x7fe1084855ea - g_signal_emit_valist
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3549:5
  17:     0x7fe108485813 - g_signal_emit
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gobject/gsignal.c:3606:3
⠒ Checking out tree 0e96fdf...                                                                                            18:     0x7fe107f999eb - on_signal_received
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gio/gdbusproxy.c:897:3
  19:     0x7fe107f88070 - emit_signal_instance_in_idle_cb
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../gio/gdbusconnection.c:3791:5
⠓ Checking out tree 0e96fdf...                                                                                            20:     0x7fe107d9b49d - g_idle_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:6124:15
  21:     0x7fe107d9c540 - g_main_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:3444:28
  22:     0x7fe107d9c540 - g_main_context_dispatch
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4162:7
  23:     0x7fe107df44e8 - g_main_context_iterate.isra.0
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4238:5
  24:     0x7fe107d9baef - g_main_loop_run
                               at /usr/src/debug/glib2-2.74.1-3.fc38.x86_64/redhat-linux-build/../glib/gmain.c:4438:5
  25:     0x55adf2a76868 - <unknown>
  26:     0x55adf2a76a64 - <unknown>
  27:     0x55adf2a69cf5 - <unknown>
  28:     0x55adf2a6dae6 - <unknown>
  29:     0x55adf2a675a4 - <unknown>
  30:     0x55adf2bf554c - <unknown>
  31:     0x55adf2a1d216 - <unknown>
  32:     0x55adf2a63892 - <unknown>
  33:     0x55adf30b1189 - <unknown>
  34:     0x55adf30a16eb - <unknown>
  35:     0x55adf30af09a - <unknown>
  36:     0x55adf315daf3 - <unknown>
  37:     0x7fe10713bc57 - start_thread
  38:     0x7fe1071c1a70 - __clone3
  39:                0x0 - <unknown>
fish: Job 1, 'env RUST_BACKTRACE=full rpm-ost…' terminated by signal SIGABRT (Abort)
istvan-derda commented 1 year ago

Random (maybe helpful) report: Happened to me too. Happened for the first time today when removing the layered gcc package (could that do something?) Will now keep happening every time I do something with the ostree (install/uninstall/rebase), but only the first time. if I run the command a second time it works.

Error happens on fedora silverblue 38 but not on 37

image

YaLTeR commented 1 year ago

It's been happening to me randomly like ⅓ of the time, I think this is just a race condition of some kind. Still happens on this version btw:

rpm-ostree:
 Version: '2023.2'
 Git: 8e81b02cfc3dabcda9af7d4275200efedbe145c7
 Features:
  - rust
  - compose
  - container
  - fedora-integration
cgwalters commented 1 year ago

Sorry about the delay on this. I put up https://github.com/coreos/rpm-ostree/pull/4348 which should help us debug. Once that PR lands a build should show up in https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/

Then env G_MESSAGES_DEBUG=rpm-ostree,rpm-ostreed RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug rpm-ostree rebase fedora:fedora/x86_64/coreos/next

should get us a bit more debugging information. Also this is a client-server setup, so we may also need to do e.g.

$ systemctl edit --runtime rpm-ostreed

Then paste in there:

[Service]
Environment=G_MESSAGES_DEBUG=rpm-ostreed
Environment=RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug

And systemctl restart rpm-ostreed.

cgwalters commented 1 year ago

Ok @YaLTeR the debug patch is available in https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/ - let me know if you need help trying out enabling that build alongside injecting the debug configuration above!

cgwalters commented 1 year ago

Also cc @istvan-derda since you hit this too

YaLTeR commented 1 year ago

Got a new one with the debug logging:

⠠ Importing rpm-md  75% [███████████████░░░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.724: txn progress PercentProgress
⠤ Importing rpm-md  80% [████████████████░░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.918: txn progress PercentProgress
⠦ Importing rpm-md  85% [█████████████████░░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:52.964: txn progress PercentProgress
⠖ Importing rpm-md  90% [██████████████████░░] (0s)                                                                     (rpm-ostree update:4686): rpm-ostreed-DEBUG: 08:32:53.041: txn progress TaskBegin
thread 'tokio-runtime-worker' panicked at 'Overwriting task: "Importing rpm-md"', rust/src/console_progress.rs:179:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)
cgwalters commented 1 year ago

OK, sadly I did the wrong thing with the debug logging...just put up https://github.com/coreos/rpm-ostree/pull/4353 which should really help this time. What's going wrong is we're trying to output a new progress bar when an old one is still active, and we need to know exactly which are old and new.

YaLTeR commented 1 year ago

Would it be possible to update the COPR with RPMs including that PR? Or just send the RPMs here for F38

YaLTeR commented 1 year ago

Btw, for me the crash reproduces basically every time if I update with rpm-ostree update and at the same time do flatpak update -y.

BrainBlasted commented 1 year ago

I'm getting this with a brand-new F38 Silverblue install FWIW

osslate commented 1 year ago

Have been experiencing this intermittently with F38 also. Sometimes rerunning rpm-ostree upgrade works, but other times the core dump happens consecutively.

For me this is happening at this point:

⠙ Running posttrans scripts...                                                                                                                                                   thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)
YaLTeR commented 1 year ago

I built an rpm from main, will try to reproduce again.

YaLTeR commented 1 year ago
⠉ Receiving objects; 83% (506/605) 3,1 MB/s 146,6 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:41.679: txn progress DownloadProgress
⠈ Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:41.909: txn progress Message
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates updates-debuginfo updates-source fedora fedora-debuginfo fedora-source updates-archive
⠈ Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB                                                                    (rpm-ostree update:26250): rpm-ostreed-DEBUG: 22:18:42.219: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Receiving objects; 84% (509/605) 3,1 MB/s 148,9 MB', rust/src/console_progress.rs:128:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)
YaLTeR commented 1 year ago

And a second one for the bulk (flatpak update -y took a while, so I managed to get two crashes):

DEBUG Creating progress "Running post scripts" with type Task
⠖ Running post scripts...                                                                                               (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:11.695: txn progress TaskEnd
DEBUG Ending progress Task; suffix=Some("done")
Running post scripts... done
(rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:11.696: txn progress TaskBegin
DEBUG Creating progress "Running posttrans scripts" with type Task
⠦ Running posttrans scripts...                                                                                          (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:15.202: txn progress Message
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates updates-debuginfo updates-source fedora fedora-debuginfo fedora-source updates-archive
⠒ Running posttrans scripts...                                                                                          (rpm-ostree update:27127): rpm-ostreed-DEBUG: 22:21:15.573: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Running posttrans scripts', rust/src/console_progress.rs:128:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fish: Job 1, 'env G_MESSAGES_DEBUG=rpm-ostree…' terminated by signal SIGABRT (Abort)
ejgr-mtsiw commented 1 year ago

Was having the same error

[someone@somewhere ~]$ rpm-ostree update
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates fedora rpmfusion-nonfree-nvidia-driver
Importing rpm-md... done
rpm-md repo 'fedora-cisco-openh264' (cached); generated: 2023-03-14T10:56:46Z solvables: 4
rpm-md repo 'fedora-modular' (cached); generated: 2023-04-14T09:25:02Z solvables: 1082
rpm-md repo 'updates-modular' (cached); generated: 2018-02-20T19:18:14Z solvables: 0
rpm-md repo 'updates' (cached); generated: 2023-04-20T04:37:27Z solvables: 3831
rpm-md repo 'fedora' (cached); generated: 2023-04-14T09:32:40Z solvables: 69222
rpm-md repo 'rpmfusion-nonfree-nvidia-driver' (cached); generated: 2023-04-08T18:30:02Z solvables: 31
⠈ Receiving metadata objects: 0/(estimating) -/s 0 bytes...
Enabled rpm-md repositories: fedora-cisco-openh264 fedora-modular updates-modular updates fedora rpmfusion-nonfree-nvidia-driver
⠁ Receiving metadata objects: 0/(estimating) -/s 0 bytes...
thread 'tokio-runtime-worker' panicked at 'assertion failed: !(self.ptype == ProgressType::Task)', rust/src/console_progress.rs:124:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Just did a rpm-ostree cleanup -b -p -r -m and everything ran fine after that. I don't think all the options are needed, but it was the command I had in history and I just ran it on a whim... I guess the problem is related to some cached content...?

shoop commented 1 year ago

Piling on here, randomly on overlay of RPM fusion mesa-vdpau-drivers-freeworld: rpm-ostree install mesa-vdpau-drivers-freeworld. Just updated to F38 (pretty smooth apart from RPM fusion stuff, thanks!). Deployment completed on repeating the command.

coredumpctl info:

stijn@tangaloor: ~
$ coredumpctl info
           PID: 3879 (rpm-ostree)
           UID: 1000 (stijn)
           GID: 1000 (stijn)
        Signal: 6 (ABRT)
     Timestamp: Fri 2023-04-21 10:42:34 CEST (1min 38s ago)
  Command Line: rpm-ostree install mesa-vdpau-drivers-freeworld
    Executable: /usr/bin/rpm-ostree
 Control Group: /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-3cf7a228-1d18-4007-aa7e-6363284335bb.scope
          Unit: user@1000.service
     User Unit: vte-spawn-3cf7a228-1d18-4007-aa7e-6363284335bb.scope
         Slice: user-1000.slice
     Owner UID: 1000 (stijn)
       Boot ID: 302b227720394945a6a38ccd187a901b
    Machine ID: 49cf6ab9e0d14993ba7805d3a9dd578b
      Hostname: tangaloor
       Storage: /var/lib/systemd/coredump/core.rpm-ostree.1000.302b227720394945a6a38ccd187a901b.3879.1682066554000000.zst (present)
  Size on Disk: 764.2K
       Message: Process 3879 (rpm-ostree) of user 1000 dumped core.

                Module libcrypt.so.2 from rpm libxcrypt-4.4.33-7.fc38.x86_64
                Module libbrotlicommon.so.1 from rpm brotli-1.0.9-11.fc38.x86_64
                Module libsasl2.so.3 from rpm cyrus-sasl-2.1.28-9.fc38.x86_64
                Module libevent-2.1.so.7 from rpm libevent-2.1.12-8.fc38.x86_64
                Module libkeyutils.so.1 from rpm keyutils-1.6.1-6.fc38.x86_64
                Module libkrb5support.so.0 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libcom_err.so.2 from rpm e2fsprogs-1.46.5-4.fc38.x86_64
                Module libk5crypto.so.3 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libkrb5.so.3 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libbrotlidec.so.1 from rpm brotli-1.0.9-11.fc38.x86_64
                Module libgssapi_krb5.so.2 from rpm krb5-1.20.1-8.fc38.x86_64
                Module libpsl.so.5 from rpm libpsl-0.21.2-2.fc38.x86_64
                Module libssh.so.4 from rpm libssh-0.10.4-4.fc38.x86_64
                Module libidn2.so.0 from rpm libidn2-2.3.4-2.fc38.x86_64
                Module libnghttp2.so.14 from rpm nghttp2-1.52.0-1.fc38.x86_64
                Module libblkid.so.1 from rpm util-linux-2.38.1-4.fc38.x86_64
                Module libattr.so.1 from rpm attr-2.5.1-6.fc38.x86_64
                Module libyaml-0.so.2 from rpm libyaml-0.2.5-9.fc38.x86_64
                Module libmagic.so.1 from rpm file-5.44-3.fc38.x86_64
                Module libzck.so.1 from rpm zchunk-1.3.1-1.fc38.x86_64
                Module libcurl.so.4 from rpm curl-7.87.0-7.fc38.x86_64
                Module libaudit.so.1 from rpm audit-3.1-2.fc38.x86_64
                Module liblua-5.4.so from rpm lua-5.4.4-9.fc38.x86_64
                Module libpopt.so.0 from rpm popt-1.19-2.fc38.x86_64
                Module librpm_sequoia.so.1 from rpm rust-rpm-sequoia-1.3.0-1.fc38.x86_64
                Module libpcre2-8.so.0 from rpm pcre2-10.42-1.fc38.1.x86_64
                Module libffi.so.8 from rpm libffi-3.4.4-2.fc38.x86_64
                Module libselinux.so.1 from rpm libselinux-3.5-1.fc38.x86_64
                Module libmount.so.1 from rpm util-linux-2.38.1-4.fc38.x86_64
                Module libgmodule-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libxml2.so.2 from rpm libxml2-2.10.4-1.fc38.x86_64
                Module libbz2.so.1 from rpm bzip2-1.0.8-13.fc38.x86_64
                Module liblz4.so.1 from rpm lz4-1.9.4-2.fc38.x86_64
                Module libzstd.so.1 from rpm zstd-1.5.5-1.fc38.x86_64
                Module liblzma.so.5 from rpm xz-5.4.1-1.fc38.x86_64
                Module libacl.so.1 from rpm acl-2.3.1-6.fc38.x86_64
                Module libostree-1.so.1 from rpm ostree-2023.1-2.fc38.x86_64
                Module libz.so.1 from rpm zlib-1.2.13-3.fc38.x86_64
                Module libcrypto.so.3 from rpm openssl-3.0.8-2.fc38.x86_64
                Module libssl.so.3 from rpm openssl-3.0.8-2.fc38.x86_64
                Module libmodulemd.so.2 from rpm libmodulemd-2.14.0-5.fc38.x86_64
                Module libsolv.so.1 from rpm libsolv-0.7.22-4.fc38.x86_64
                Module libsolvext.so.1 from rpm libsolv-0.7.22-4.fc38.x86_64
                Module librepo.so.0 from rpm librepo-1.15.1-2.fc38.x86_64
                Module libjson-c.so.5 from rpm json-c-0.16-4.fc38.x86_64
                Module libsystemd.so.0 from rpm systemd-253.2-1.fc38.x86_64
                Module librpmio.so.9 from rpm rpm-4.18.1-1.fc38.x86_64
                Module librpm.so.9 from rpm rpm-4.18.1-1.fc38.x86_64
                Module libglib-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libgobject-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libgio-2.0.so.0 from rpm glib2-2.76.1-1.fc38.x86_64
                Module libpolkit-gobject-1.so.0 from rpm polkit-122-3.fc38.1.x86_64
                Module libarchive.so.13 from rpm libarchive-3.6.1-4.fc38.x86_64
                Module libjson-glib-1.0.so.0 from rpm json-glib-1.6.6-4.fc38.x86_64
                Module libcap.so.2 from rpm libcap-2.48-6.fc38.x86_64
                Stack trace of thread 3880:
                #0  0x00007fc118468b94 __pthread_kill_implementation (libc.so.6 + 0x8eb94)
                #1  0x00007fc118417aee raise (libc.so.6 + 0x3daee)
                #2  0x00007fc11840087f abort (libc.so.6 + 0x2687f)
                #3  0x0000565195440917 _ZN11panic_abort18__rust_start_panic5abort17h3c900776fa44b50cE (rpm-ostree + 0x99e917)
                #4  0x00007fc104029720 n/a (n/a + 0x0)
                ELF object binary architecture: AMD x86-64
ghost commented 1 year ago

I also have this time to time no idea where it come, at first i've thinked it would come from some source blocked, but i never have checked more since i've seen some other had also the problem.

I say i have it to see any potential fix, and let know that one more user have the problem.

garrett commented 1 year ago

FWIW, this happens on every one of my systems for the past few months; I thought it would be fixed quickly (as these things usually are), and I've tried all the workarounds listed here (clearing cache and such), but it still happens. I have to try to upgrade 2 to 5 times before it finishes with "no changes" (even though there are changes according to rpm-ostree status and reboots); it's likely because the last time, when it's successful, has no changes to the previous time, when it failed with the error.

I do have overrides on my machines, and I do have third party repos enabled (OBS for darktable on my personal desktop and a copr for a patched mutter on both personal and work) — but this was even happening on my work machine without any additional repos (before I added the patched triple-buffered mutter). However, I do have a local package on my work laptop (for VPN certificates) and layered Fedora packages on both.

garrett commented 1 year ago

I've asked my coworkers who also use Silveblue and they haven't seen the issue. They don't layer packages, however.

So my guess is that it's really related to layering packages.

ghost commented 1 year ago

I've asked my coworkers who also use Silveblue and they haven't seen the issue. They don't layer packages, however.

So my guess is that it's really related to layering packages.

The worst it's this bug wasn't there on 37

cgwalters commented 1 year ago

@garrett Can you try grabbing some of the debug info using env G_MESSAGES_DEBUG=rpm-ostree,rpm-ostreed RUST_LOG=rpm_ostree=debug,rpmostree_rust=debug rpm-ostree upgrade or so from https://github.com/coreos/rpm-ostree/issues/4284#issuecomment-1483911450 ?

It'd be particularly helpful to use rpm-ostree-2023.3 as a base for updating.

I've tried briefly again today to reproduce this, and took another tour through the code trying to retrofit these symptoms to code paths; but no luck so far. If we don't succeed soon I'll try just spamming debug log prints around the relevant code paths I can think of and have someone who is hitting this update to that.

garrett commented 1 year ago

@cgwalters: Sure!

rpm-ostree:
 Version: '2023.3'
 Git: 8ab6f143a0ecad8b125b47dee8bbeb2e99f1b215
 Features:
  - rust
  - compose
  - container
  - fedora-integration

On my work laptop, before I realized I should modify the service too:

rpm-ostree--rebase--debug.txt

On my personal workstation, where I did modify the service before:

rpm-ostree--debug--personal.txt

The second of these is probably more useful, as it should have more debugging info:

cgwalters commented 1 year ago

In

Writing rpmdb...done
(rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:37:59.031: txn progress TaskBegin
DEBUG Creating progress "Writing OSTree commit" with type Task
⠒                                                                               (rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:38:09.273: txn progress Message
Enabled rpm-md repositories: fedora-modular updates-modular updates fedora graphics_darktable graphics_darktable_master copr:copr.fedorainfracloud.org:calcastor:gnome-patched updates-archive
(rpm-ostree upgrade:37328): rpm-ostreed-DEBUG: 11:38:09.377: txn progress PercentProgress
thread 'tokio-runtime-worker' panicked at 'expected non-task progress type; current message is Writing OSTree commit', rust/src/console_progress.rs:128:13

It's really strange to me that we're seeing the "Enabled rpm-md repositories" message at that point...that seems like a leftover bit that should have been seen earlier.

I'm starting to build up a theory that this is somehow related to us iterating a new main context in the commit path, but we have leftover queued idle work on the mainloop or so? But not yet clear to me what that is...

@garrett Can you also attach the daemon-side logs from that run? i.e. journalctl -u rpm-ostreed around that time.

garrett commented 1 year ago

@cgwalters: Apologies for the delay! Here are the logs from ostreed on May 09 from my personal machine:

ostreed-may09.txt

I think the relevant entries should be around 11:37 - 11:38.

cgwalters commented 1 year ago

All of the repeated "loading fedora" messages looked very odd to me, and digging in this is coming from the "sysroot changed' signal. We don't have even debug logging for invocation of these messages (will add) but...that gave me a clue.

I think I'm starting to understand this bug. Basically what I think is happening right now is that there are two processes talking to the daemon (gnome-software and /usr/bin/rpm-ostree).

While the human has invoked e.g. /usr/bin/rpm-ostree upgrade, at the same time gnome-software is calling one of the other os methods that are not transactions (and hence not serialized). I'm not sure which yet; maybe e.g. get_deployments_rpm_diff?

But basically at some point, one of these other methods ends up calling something in the core that tries to refresh rpm-md, and that in turn will try to invoke the progress APIs, and that will cause this crash.

First I'm going to add a lot more useful debug logging to this path to verify.

I think the best fix here is to change our output infrastructure to only route through the dbus txn progress for the transaction thread.

cgwalters commented 1 year ago

Ahhh yes I think the core bug here is everything using os_create_dnf_context_simple (which is the "not transactional inspection" stuff) calling rpmostree_context_download_metadata which will indeed enable the progress infrastructure. That'd completely explain the race here.

cgwalters commented 1 year ago

Yep, this is really easy to reproduce now that I understand the bug; having layered packages makes it much much more likely (but I think it may be reproducible without).

Basically in one shell: while rpm-ostree upgrade >/dev/null; do :; done And in another shell, use this handy helper to do rpmostree_busctl_call_os WhatProvides as 1 provided-testing-daemon. That really quickly causes the first client to crash.

So yes, will look at reworking the transaction progress stuff to be thread-local.

cgwalters commented 1 year ago

OK, got a PR up in https://github.com/coreos/rpm-ostree/pull/4405

juhp commented 1 year ago

I also hit this several times today in the office, but when I came home 'update' worked first time.

cgwalters commented 1 year ago

Can one or more people try out https://copr.fedorainfracloud.org/coprs/g/CoreOS/continuous/build/5919148/ which has the fix merged?

YaLTeR commented 1 year ago

Well, it's just a single sample, but I just did rpm-ostree update together with flatpak update -y, which usually triggered the issue, and it went fine.

cgwalters commented 1 year ago

This is now up as https://bodhi.fedoraproject.org/updates/FEDORA-2023-ce6c9dcd69