coreos / rpm-ostree

⚛📦 Hybrid image/package system with atomic upgrades and package layering
https://coreos.github.io/rpm-ostree
Other
870 stars 195 forks source link

rpm-ostree upgrade fail with "Couldn't find file object " #1178

Open mscherer opened 6 years ago

mscherer commented 6 years ago

Host system details

# rpm-ostree  status 
State: idle
Deployments:
● fedora-atomic-27:fedora/27/x86_64/atomic-host
                   Version: 27.25 (2017-12-10 18:40:57)
                    Commit: a2b80278eea897eb1fec7d008b18ef74941ff5a54f86b447a2f4da0451c4291a
             PendingCommit: b5845ebd002b2ec829c937d68645400aa163e7265936b3e91734c6f33a510473
            PendingVersion: 27.44 (2018-01-01 21:54:31)
              GPGSignature: Valid signature by 860E19B0AFA800A1751881A6F55E7430F5282EE4

Expected vs actual behavior

# rpm-ostree  upgrade 
1 delta parts, 1 loose fetched; 1 KiB transferred in 2 seconds
error: Checking out tree: Couldn't find file object 'f74fc59945973c6c6b30cdbc78d13112f84897ae50d8660706766f0951cc8a62'

Expected:

# rpm-ostree upgrade
...
Success!

Steps to reproduce it

So I was trying to upgrade my system, since there is a new release.

Then I did had this error:

[root@osnode-3 ~]# rpm-ostree  upgrade -r 

Receiving delta parts: 3/9 92,9 ko/s 119,4 Mo/242,1 Mo                                                                                                                                                           
error: [18] Transferred a partial file

I retried again upgrade, and then the error message appeared.

Then I figured "there is a cache issue", and tried the various cleanup options (-b, -r, -m, etc), and none changed. I didn't reboot yet.

In the log, I can see

Initiated txn Upgrade for client :1.16 (uid 0): /org/projectatomic/rpmostree1/fedora_atomic
libostree HTTP error from remote fedora-atomic-27 for <https://kojipkgs.fedoraproject.org/atomic/27/objects/be/26bcb0e4ebe1ca10e5fdce21f5c76d98df08f25c1cc57b60f1930f9cba50e9.filez>: Transferred a partial file
libostree HTTP error from remote fedora-atomic-27 for <https://kojipkgs.fedoraproject.org/atomic/27/objects/f7/4fc59945973c6c6b30cdbc78d13112f84897ae50d8660706766f0951cc8a62.filez>: Transferred a partial file
libostree HTTP error from remote fedora-atomic-27 for <https://kojipkgs.fedoraproject.org/atomic/27/objects/8e/ecfce72b8a417a2cfb29b27fea66ba30a9413d9f7e690b7b6958bf0964e599.filez>: Transferred a partial file
libostree HTTP error from remote fedora-atomic-27 for <https://kojipkgs.fedoraproject.org/atomic/27/deltas/or/gCeO6ol+sf7H0AixjvdJQf9aVPhrRHovTaBFHEKRo-tYRevQArLsgpyTfWhkVACqFj5yZZNrPpFzTG8zpRBHM/6>: Error in the HTTP2 framing layer
Txn /org/projectatomic/rpmostree1/fedora_atomic failed: [18] Transferred a partial file

So while HTTP2 error are well know, I suspect this did result into a crappy state on disk.

Would you like to work on the issue?

I do not know where to start, but will diagnose as much as I can.

cgwalters commented 6 years ago

Hmm. This isn't (likely) related to the HTTP2 issues. It sounds like the commit was written, but an object was missing.

ostree fsck should have a bit more information.

You should be able to work around this by doing:

# touch /ostree/repo/state/b5845ebd002b2ec829c937d68645400aa163e7265936b3e91734c6f33a510473.commitpartial

or so. See also: https://github.com/ostreedev/ostree/pull/345#issuecomment-262263824

cgwalters commented 6 years ago

(It'd all work a lot more nicely if we had fsck automatically drop commitpartial for missing objects, will do that at some point)

mscherer commented 6 years ago

So the problem seems to have fixed by itself, as I was trying to diagnose things, and it just started to work while I was using strace on it. At least it seems now to download, I will report if that's truly fixed or not.

mscherer commented 6 years ago

Now, I get:

[root@osnode-3 ~]# rpm-ostree  upgrade 

Receiving delta parts: 0/4 1,2 Mo/s 11,2 Mo/216,8 Mo                                                  ^[[Receiving delta parts: 3/4 126,8 ko/s 96,7 Mo/216,8 Mo                                                
error: opcode set-read-source: Couldn't find file object '2c9d171e5c0ff46027953f045b8fb9f5cec383ef95c95bc9400fd9d8b9f2c81e'
cgwalters commented 6 years ago

What does ostree fsck say? Are there a lot of missing objects?

mscherer commented 6 years ago

Yes, and after reboot, rpm-ostree do not even work anymore. I suspect my attempt at cleanup didn't help.

mscherer commented 6 years ago

So, after some manual fiddling, I now have this:

janv. 04 18:58:23 osnode-3.example.org systemd-coredump[1300]: Process 1296 (rpm-ostree) of user 0 dumped core.

                                                               Stack trace of thread 1296:
                                                               #0  0x00007f396d3942a0 g_str_hash (libglib-2.0.so.0)
                                                               #1  0x00007f396d392f5b g_hash_table_insert_internal (libglib-2.0.so.0)
                                                               #2  0x00007f396d3df871 g_variant_dict_init (libglib-2.0.so.0)
                                                               #3  0x00007f396d3df8d1 g_variant_dict_new (libglib-2.0.so.0)
                                                               #4  0x00005584cbe151ac rpmostree_deployment_get_layered_info (rpm-ostree)
                                                               #5  0x00005584cbe34e9f rpmostreed_deployment_generate_variant (rpm-ostree)
                                                               #6  0x00005584cbe332c6 sysroot_populate_deployments_unlocked (rpm-ostree)
                                                               #7  0x00005584cbe3437d rpmostreed_sysroot_populate (rpm-ostree)
                                                               #8  0x00005584cbe31613 rpmostreed_daemon_initable_init (rpm-ostree)
                                                               #9  0x00007f396d922367 g_initable_new_valist (libgio-2.0.so.0)
                                                               #10 0x00007f396d922419 g_initable_new (libgio-2.0.so.0)
                                                               #11 0x00005584cbe052b3 start_daemon (rpm-ostree)
                                                               #12 0x00005584cbe05747 rpmostree_builtin_start_daemon (rpm-ostree)
                                                               #13 0x00005584cbdf2b59 main (rpm-ostree)
                                                               #14 0x00007f396c95b03a __libc_start_main (libc.so.6)
                                                               #15 0x00005584cbdf2d8a _start (rpm-ostree)

                                                               Stack trace of thread 1297:
                                                               #0  0x00007f396ca478bb __poll (libc.so.6)
                                                               #1  0x00007f396d3a4ed9 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                                                               #2  0x00007f396d3a4fec g_main_context_iteration (libglib-2.0.so.0)
                                                               #3  0x00007f396d3a5031 glib_worker_main (libglib-2.0.so.0)
                                                               #4  0x00007f396d3cc4c6 g_thread_proxy (libglib-2.0.so.0)
                                                               #5  0x00007f396cd26609 start_thread (libpthread.so.0)
                                                               #6  0x00007f396ca53e6f __clone (libc.so.6)

                                                               Stack trace of thread 1298:
                                                               #0  0x00007f396ca478bb __poll (libc.so.6)
                                                               #1  0x00007f396d3a4ed9 g_main_context_iterate.isra.25 (libglib-2.0.so.0)
                                                               #2  0x00007f396d3a5272 g_main_loop_run (libglib-2.0.so.0)
                                                               #3  0x00007f396d98db36 gdbus_shared_thread_func (libgio-2.0.so.0)
                                                               #4  0x00007f396d3cc4c6 g_thread_proxy (libglib-2.0.so.0)
                                                               #5  0x00007f396cd26609 start_thread (libpthread.so.0)
                                                               #6  0x00007f396ca53e6f __clone (libc.so.6)

Manual fiddling, I reserectued object, and then marked them as "partial" with touch

# ls /ostree/repo/state/
a2b80278eea897eb1fec7d008b18ef74941ff5a54f86b447a2f4da0451c4291a.commitpartial  b5845ebd002b2ec829c937d68645400aa163e7265936b3e91734c6f33a510473.commitpartial
mscherer commented 6 years ago

And fsck is good:

# ostree fsck 
Validating refs...
Enumerating objects...
Verifying content integrity of 1 commit objects...
1/29227 objects
2923/29227 objects
5845/29227 objects
8767/29227 objects
11689/29227 objects
14611/29227 objects
17533/29227 objects
20455/29227 objects
23377/29227 objects
26299/29227 objects
29221/29227 objects
2 partial commits not verified
cgwalters commented 6 years ago

fsck won't error on missing objects if the commits are marked as partial.

cgwalters commented 6 years ago

Weird, not sure what's going on with that stack trace...the code definitely handles missing commits (though doing so does cause the daemon to error out here): Jan 05 02:54:21 localhost.localdomain rpm-ostree[4318]: error: Couldn't start daemon: Error setting up sysroot: Reading deployment 1: No such metadata object b5845e...

Hmm. At this point if you have the ability to capture the state of this system and post it it'd be useful. You probably don't need the whole disk image, just e.g. tar czf ostree-repo.tar.gz /ostree/repo/ or so. Plus an ls -al /ostree/deploy/fedora-atomic/deploy/ and the contents of the .origin files there.

mscherer commented 6 years ago

I think I did butcher more the repo by trying to fix, and I have now a different errors. And Meltdown/spectre did happen, so I kinda forgot what I did :/

But there is no a2b80278eea897eb1fec7d008b18ef74941ff5a54f86b447a2f4da0451c4291a.0.origin in /ostree/deploy/fedora-atomic/deploy/ so I guess that's the issue.

mscherer commented 6 years ago

Ok so no, the lack of .origin is not the cause. I did fix by adding one, that's changing nothing to my new error message. I think I should maybe close this issue, cause the more I try to diagnose, the more I break, and that's kinda not useful.

Once that's working, I can try again to run ostree upgrade and kill it until I get a corrupted enough repo for debugging pleasure.