TritonDataCenter / pkgsrc

NetBSD/pkgsrc fork for our binary package repositories
https://pkgsrc.smartos.org/
131 stars 51 forks source link

Frequent problems with `pkgin upgrade` or `pkgin full-upgrade` and a few packages, especially sqlite3 #363

Open drboone opened 1 year ago

drboone commented 1 year ago

Per brief IRC discussion:

it's a real issue that, the same as the "pkg conflicts with ", I've not had a situation where I can reproduce and fix it so yeh, please raise an issue and include any information you can, e.g. a tarball of pkgdb would be handy

raising this issue about conflicts of a package with itself in the hope that I have data that might help track down the problem.

Extracts related to sqlite3 from pkg_install-err.log:

---Jan 09 14:41:37: upgrading sqlite3-3.40.0nb1...
---Feb 04 08:14:27: upgrading sqlite3-3.40.1...
pkg_delete: couldn't entirely delete package `sqlite3-3.40.0nb1'
---Mar 01 08:14:32: upgrading sqlite3-3.41.0...
---Mar 12 08:16:32: refreshing sqlite3-3.41.0...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3
---Mar 13 08:12:27: upgrading sqlite3-3.41.1...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3
---Mar 14 08:09:57: upgrading sqlite3-3.41.1...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3
---Mar 15 08:11:58: upgrading sqlite3-3.41.1...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3
---Mar 17 08:12:08: upgrading sqlite3-3.41.1...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3

pkgdb.byfile.db is attached, gzipped because %^&*( github.

pkgdb.byfile.db.gz

drboone commented 1 year ago

Here's a longer log extract for a machine that's currently exhibiting the sqlite3 issue.

bigriver.txt

drboone commented 10 months ago

I have several machines today that are having trouble upgrading openssl. It's an easy workaround -- pkg_delete -f, pkg_add. Log from one:

---Oct 18 14:33:05: [3/4] upgrading openssl-1.1.1w...
pkg_add: Conflicting PLIST with openssl-1.1.1pnb1: bin/c_rehash
pkg_add: 1 package addition failed

There's further weirdness - some packages claim to install, but another full-upgrade will do it over and over. And I've seen e.g. mozilla-rootcerts listed twice in the upgrade list.

drboone commented 5 months ago

Seeing this today on routine pkgin full-upgrade:

---Mar 05 14:52:07: [1/1] upgrading pkg_install-20240126...
pkg_add: Conflicting PLIST with pkg_install-20210410: man/man1/pkg_add.1.gz
pkg_add: 1 package addition failed
jperkin commented 5 months ago

Ugh, yeh, lemme see if I can carve out time tomorrow to try and nail this down once and for all.

drboone commented 5 months ago

If it'd help to have ssh access to an affected machine, I can arrange.

drboone commented 4 months ago

The corrupted database warning you mentioned in IRC the other day has appeared on several of our systems:

beautiful 5 $ pkg_admin rebuild
pkg_admin: corrupt pkgdb, duplicate PKGBASE entries:
        pkgsrc-gnupg-keys-20190423
        pkgsrc-gnupg-keys-20201014

So what's the proper cleanup process here? I'm pretty sure I've removed specific package version in the past, possibly using pkg_add to get key packages back.

drboone commented 4 months ago

Digging deeper, I'll add that the most recent gz where I've had pkgin full-upgrade problems does not exhibit the corrupt pkgdb errors, but does still have conflicting file problems this morning:

---Apr 04 12:13:29: [1/19] refreshing ncurses-6.4...
---Apr 04 12:13:29: [2/19] refreshing readline-8.2nb2...
---Apr 04 12:13:29: [3/19] refreshing sqlite3-3.45.2...
---Apr 04 12:13:29: [4/19] refreshing xz-5.4.6...
---Apr 04 12:13:30: [5/19] refreshing ncursesw-6.4...
---Apr 04 12:13:30: [6/19] upgrading pkg_install-20240307...
pkg_add: Conflicting PLIST with pkg_install-20211115: man/man1/pkg_add.1.gz
pkg_add: 1 package addition failed
---Apr 04 12:13:30: [7/19] refreshing python311-3.11.8...
---Apr 04 12:13:32: [8/19] refreshing python312-3.12.2...
---Apr 04 12:13:35: [9/19] upgrading pkg_install-20240307...
pkg_add: Conflicting PLIST with pkg_install-20211115: man/man1/pkg_add.1.gz
pkg_add: 1 package addition failed
---Apr 04 12:13:35: [10/19] refreshing libarchive-3.7.2...
---Apr 04 12:13:35: [11/19] refreshing pkgsrc-gnupg-keys-20231210...
pkg_add: Conflicting PLIST with pkgsrc-gnupg-keys-20201014: share/gnupg/pkgsrc-security.gpg
pkg_add: 1 package addition failed
---Apr 04 12:13:35: [12/19] upgrading pkgin-23.8.1nb3...
---Apr 04 12:13:35: [13/19] refreshing pkgin-23.8.1nb3...
---Apr 04 12:13:36: [14/19] refreshing py312-pip-24.0...
---Apr 04 12:13:36: [15/19] upgrading bsdinstall-20160108nb1...
---Apr 04 12:13:36: [16/19] refreshing py312-wheel-0.43.0...
---Apr 04 12:13:36: [17/19] refreshing py312-setuptools-69.2.0...
---Apr 04 12:13:36: [18/19] refreshing py311-pip-24.0...
---Apr 04 12:13:37: [19/19] refreshing py311-wheel-0.43.0...

This is a machine I'm quite convinced has never had an improper tools or bootstrap kit applied -- it got the tools during the new-machine install, and hasn't been messed with.

drboone commented 3 months ago

Another round of this, still with no errors from pkg_admin rebuild:

---May 13 12:14:09: [1/5] upgrading pkg_install-20240307...
pkg_add: Conflicting PLIST with pkg_install-20211115: man/man1/pkg_add.1.gz
pkg_add: 1 package addition failed
---May 13 12:14:10: [2/5] upgrading pkg_install-20240307...
pkg_add: Conflicting PLIST with pkg_install-20211115: man/man1/pkg_add.1.gz
pkg_add: 1 package addition failed
---May 13 12:14:10: [3/5] refreshing pkgsrc-gnupg-keys-20231210...
pkg_add: Conflicting PLIST with pkgsrc-gnupg-keys-20201014: share/gnupg/pkgsrc-security.gpg
pkg_add: 1 package addition failed
---May 13 12:14:10: [4/5] upgrading pkgin-23.8.1nb3...
---May 13 12:14:10: [5/5] upgrading bsdinstall-20160108nb1...
avenueq 6 $ pkg_admin rebuild

Stored 27252 files and 1 explicit directory from 45 packages in /opt/tools/var/db/pkg/pkgdb.byfile.db.
Done.
jperkin commented 3 months ago

Some of the discussion for this ticket has been done on IRC, so I'll just try to summarise everything here so that it's all in one place.

The core problem here is that something is corrupting the pkgdb, specifically by extracting at least one package, usually more, over the top of an existing install, so that there ends up being duplicate directory entries for the same PKGBASE in the pkgdb directory.

The pkgdb directories are:

Each directory entry inside them refers to an individual installed package, and critically there must only ever be one unique entry for each package (minus the version number). There must never be e.g. foo-1.0 and foo-1.1. For example, taking one of the failures from output in the comment above:

---May 13 12:14:10: [3/5] refreshing pkgsrc-gnupg-keys-20231210...
pkg_add: Conflicting PLIST with pkgsrc-gnupg-keys-20201014: share/gnupg/pkgsrc-security.gpg

This shows that there are both pkgsrc-gnupg-keys-20231210 and pkgsrc-gnupg-keys-20201014 entries inside the pkgdb, and this then results in the cascading failures.

The various pkgin upgrade problems here are merely symptoms, not the cause. The pkgdb was already corrupted prior to pkgin being executed.

The question is, how? Going back to the bigriver.txt log is interesting, specifically when tracing sqlite3 entries.

---Mar 01 08:14:32: upgrading sqlite3-3.41.0...

The upgrade on Mar 01 worked fine, sqlite3 was apparently upgraded to 3.41.0 with no issues.

---Mar 03 08:09:27: upgrading sudo-1.9.13p2...
---Mar 03 08:09:28: refreshing npm-8.15.1...
---Mar 03 08:09:28: refreshing nodejs-19.7.0...

These are the only entries from this date. This looks like a regular upgrade that worked fine, and only needed to touch these three packages.

---Mar 12 08:16:32: refreshing sqlite3-3.41.0...
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3
pkg_add: 1 package addition failed

This is where things go sideways. Pretty much every package has been selected for either refresh or upgrade. This can be normal, especially if there was a bump in a core package that resulted in a rebuild of every package.

However, where did sqlite3-3.39.0 come from? The sqlite3 package was upgraded to 3.41.0 11 days prior to this with no errors, and there were no errors on Mar 03 where if there was a 3.39.0 package lying around it would have been selected for upgrade.

Looking over all of the logs, the packages that so far have exhibited this issue are:

openssl
pkg_install
pkgsrc-gnupg-keys
sqlite3

These packages all have one thing in common, in that they are (or at least were) all bootstrap packages that are distributed as part of the bootstrap kit tarball. I am almost certain that the underlying cause of all these problems is that a bootstrap kit is being unpacked over the top of an existing install. To my knowledge I've not yet seen any examples of this issue where the package causing the problems is outside of the bootstrap kit, which would further rule out issues with e.g. pkg_install not upgrading packages correctly.

To be more specific, here are the packages including versions that have exhibited the problems:

pkg_add: Conflicting PLIST with openssl-1.1.1pnb1: bin/c_rehash
pkg_add: Conflicting PLIST with pkg_install-20211115: man/man1/pkg_add.1.gz
pkg_add: Conflicting PLIST with pkgsrc-gnupg-keys-20201014: share/gnupg/pkgsrc-security.gpg
pkg_add: Conflicting PLIST with sqlite3-3.39.0: bin/sqlite3

These correspond exactly to the versions that were distributed as part of the bootstrap-trunk-tools-20220706.tar.gz bootstrap kit:

$ tar ztf bootstrap-trunk-tools-20220706.tar.gz | grep CONTENTS | egrep 'openssl|pkg_install-2|pkgsrc-gnupg-keys|sqlite3' | sort
./opt/tools/var/db/pkg/openssl-1.1.1pnb1/+CONTENTS
./opt/tools/var/db/pkg/pkg_install-20211115/+CONTENTS
./opt/tools/var/db/pkg/pkgsrc-gnupg-keys-20201014/+CONTENTS
./opt/tools/var/db/pkg/sqlite3-3.39.0/+CONTENTS

One other thing to mention is that in cases where pkg_install is not upgraded, you won't see any of the new corrupt pkgdb warnings that I've added, as you'll still be running an older version that doesn't have them.

I think what I'd suggest at this point is having something like this handy (swap the pkgdb directory for normal zones as required):

$ ls /opt/tools/var/db/pkg | awk '/-/ { sub("-[^-]*$", ""); if (seen[$0]) { print "ERROR: " $0; exit 1; } else { seen[$0] = 1 }}'

If you're able to add this one-liner to both before and after running pkgin upgrade (I believe you've mentioned using ansible in the past? if so adding it as a pre-requisite task that must exit 0 before pkgin is run, and then again after), it will help catch pkgdb corruption prior to running pkgin and stop the attempted upgrade, and that may help narrow down the point at which a bootstrap kit is unpacked over the top, especially if any previous runs ran that command successfully after a pkgin upgrade (thus confirming that the upgrade was clean).

In terms of cleaning up installs that are broken, wherever possible I'd strongly recommend a wipe and reinstall of the pkgsrc areas (/opt/tools in a GZ, /opt/local and /var/db/pkgin in a zone), just to make sure there are no leftovers of corruption. Tools such as pkgin export / pkgin import can help with that. Otherwise, it's a case of manually looking in the pkgdb at the duplicate directory entries, and removing the directory entries that do not correspond to the installed binaries. After doing this, running pkg_admin rebuild; pkg_admin rebuild-tree may get things back to a consistent state, but there is always the chance that some on-disk binaries are not correct.

drboone commented 3 months ago

Thanks for the detailed analysis.

I've done a bit of digging into the installer tooling. This one gz where I have conflicts has pkg_admin 20240126 (explains lack of sanity check) and was installed with the 20231113 platform image. Its name is avenueq, and it's the one I refer to in the April 4 and May 13 notes above. I've been focusing on that one for a while because I'm quite sure that it had its tooling installed properly, as opposed to other older gz or guest systems where I may have done something stoopid.) During install, it appears that platform used bootstrap-trunk-tools-20220706.tar.gz to set up pkgsrc. This seems to track with your comment above regarding versions. So I'm still puzzled about how this one machine got here.

I'll do the export/wipe/reinstall/import thing on this one machine and see how it goes.