rpm --rebuilddb causes all future rpm functions to segfault/yum to hang

sirredbeard commented 5 years ago

Update: This is a confirmed bug in WSL, not Berkeley DB.

Reproducible on RHEL, CentOS, Oracle, and Scientific Linux.

To reproduce:

Install WLinux Enterprise. A signed build you can sideload on Windows 10 for testing is available here for debugging this bug, see here to see custom DB_CONFIG set on that image.
Set root password and create new default user.
su - into root.
Example 1: Type the following commands:

[root@t470s ~]# rpm -q rpm
rpm-4.11.3-35.el7.x86_64
[root@t470s ~]# rpm --rebuilddb
[root@t470s ~]# rpm -q rpm
Segmentation fault (core dumped)
[root@t470s ~]#

Example 2: Type the following commands:

[hayden@t470s ~]$ sudo rm -rf /var/lib/rpm/__db*
[hayden@t470s ~]$ db_verify /var/lib/rpm/Packages
BDB5105 Verification of /var/lib/rpm/Packages succeeded.
[hayden@t470s ~]$ sudo rpm --rebuilddb
[hayden@t470s ~]$ sudo yum update
Loaded plugins: ovl
Segmentation fault (core dumped)
[hayden@t470s ~]$

Expected result:

rpm --rebuilddb would rebuild a working rpmdb.

Actual result:

Running rpm --rebuilddb breaks rpm and yum. Modifications to BD_CONFIG improve things somewhat and there is a partial fix after the fact.

Theories:

rpm database is corrupted in some way, left over from our build. Unlikely because copying rpmdb from our build to a bare metal install works fine
rpm database rebuild fails, perhaps because rpm uses berkeleydb, and mmap is misbehaving. We have doubts that is problem because straces do not show typical evidence of that mmap misbehavior. https://github.com/Microsoft/WSL/issues/902 https://github.com/Microsoft/WSL/issues/2871 https://github.com/Microsoft/WSL/issues/2852
rpm database is read into memory incorrectly by libdb, suggested by fact setting defined memory parameters appears to mitigate some of this bug.

What works:

This workaround, useful for recovering rpmdb after rpm --rebuild.
OpenSUSE's implementation of rpm, but for very different reasons
Setting some DB_CONFIG flags improves behaviors but just causes different issues
A single report of a custom Docker RHEL 7.5 image pull of working, but not 7.3. This bug can be reproduced on images built using the CentOS Docker image kickstart file as built with this script.

Logs:

Thanks:

Thank you so far to @therealkenc @crramirez @Conan-Kudo @daviesalex and @pmatilai for working on this issue.

sirredbeard commented 5 years ago

strace for rpm --rebuilddb: https://gist.github.com/sirredbeard/b75c1d5faace290e6775a109efb1278b

strace for rpm -q rpm after rebuild: https://gist.github.com/sirredbeard/c5c0e9aefdd10d08a1e33014ddc930d3

therealkenc commented 5 years ago

rpm database rebuild fails, perhaps because rpm uses berkeleydb:

My money. Could in principle by worked around local-disto patches to libdb along the lines of what the win32/Cygwin port does. For WSL it might be as simple as tweaking some mmap() related build flags to libdb. Or it might be less trivial because of #ifdef _WIN32 blocks that make broader behavioral assumptions. Never took the time to look in detail.

sirredbeard commented 5 years ago

Do the straces above resemble other mmap failures we have seen in berkeleydb?

This workaround seems to be promising. https://github.com/Microsoft/WSL/issues/1812#issuecomment-290399704

therealkenc commented 5 years ago

Do the straces above resemble other mmap failures we have seen in berkeleydb?

No not in the sense I can point to a hard fail like the issues you cited. [Bear in mind, that database could be borked in subtle ways before you even did the --rebuild run that might not manifest in a read-only rpm operation, even a rpm --install.]

Yes that work-around is promising, or at least a promising line of inquiry. I'm not convinced that whizzter's point 2 from the incipient "mmap's problem" issue was addressed but then I've never done a test case to prove it isn't, either. Doing that 1MB padding is basically the same as I was talking about with openldap's code here. That particular code is a discount BDB-alike backend (not the real Berkeley/Oracle thing) but same idea. If you look at the #ifdef _WIN32 side they're setting the file pointer to msize which is effect the same as the manual touch(1)/dd(1) hack you cite.

This has been the standard work-around for not being able to ftruncate() (manually or implicitly) a mapped file in win32 since ever. If there's always backing store behind the mapped pages the problem just doesn't happen. Except that you have to keep the padded file size sane in WSL because unlike win32, WSL doesn't support sparse files (holes).

sirredbeard commented 5 years ago

Thank you for your help.

We are going to look at the work-around.

It could also be a general rpm database issue like you mention but otherwise basic rpm, yum, and dnf work fine.

The seg fault in rpm -q rpm occurs on a write, so I also wonder if we're dealing with a permissions issue here.

therealkenc commented 5 years ago

so I also wonder if we're dealing with a permissions issue here.

Wouldn't be my first guess, but who knows. From the strace log, all the mmap() succeed and there isn't a ftruncate() in sight. Okay so far so good, in the sense no hard fail. But note Ben says:

I'm testing a fix for this now, but it looks like lmdb is hitting another error after this: mtest.c:113: mdb_cursor_get(cursor, &key, &data, MDB_LAST): MDB_NOTFOUND: No matching key/data pair found

You're probably hitting the cause of the "another error" mentioned (whatever it is).

Also be real cautious because errors in WSL that are related to open handles look like permission problems but really have nothing to do with permissions. Unless they have to do with permissions, mind you, in which case you disregard everything I just said.

Conan-Kudo commented 5 years ago

Has anyone tested if rebuilding libdb in WLE would fix it? My understanding is that libdb does a feature test for mmap() behavior and will use fallbacks when it is detected to be broken.

sirredbeard commented 5 years ago

Before we get to rebuilding libdb, which we can do if we have to, let's look at less invasive means by padding the rpm db.

This is what I have tried:

$ mkdir /tmp/rpm/
$ cp /var/lib/rpm/* /tmp/rpm/
$ rpm --rebuilddb
$ cd /var/lib/rpm
$ touch {__db.001,__db.002,__db.003}
$ find . -type f -exec dd if=/dev/zero of={} count=0 bs=1 seek=1M \;
$ rpm -qa rpm

returns:

error: db5 error(-30986) from dbcursor->c_get: BDB0075 DB_PAGE_NOTFOUND: Requested page not found
error: db5 error(-30986) from dbcursor->c_get: BDB0075 DB_PAGE_NOTFOUND: Requested page not found

sirredbeard commented 5 years ago

Here is strace on the above rpm -qa after rebuild and then padding.

I also tried padding before the rpm rebuild:

$ cd /var/lib/rpm
$ touch {__db.001,__db.002,__db.003}
$ find . -type f -exec dd if=/dev/zero of={} count=0 bs=1 seek=1M \;
$ rpm --rebuilddb

Returns:

error: db5 error(-30986) from dbcursor->c_get: BDB0075 DB_PAGE_NOTFOUND: Requested page not found

Here is strace on that rpm --rebuilddb with padding first.

therealkenc commented 5 years ago

error: db5 error(-30986) from dbcursor->c_get: BDB0075 DB_PAGE_NOTFOUND: Requested page not found

Which looks a whole lot like this dontit:

mtest.c:113: mdb_cursor_get(cursor, &key, &data, MDB_LAST): MDB_NOTFOUND: No matching key/data pair found

Per your strace, isn't a hard fail kinda thing. And the behavior might be dependent on the contents of the rpm database.

You could try the same dd(1) padding trick with the same input in a Linux VM and see if you can track a diverge versus WSL. If it works in Real Linux but not WSL, then set a breakpoint at that db5 error and in WSL see how it gets there. [Contrast if it doesn't work with the same input and padding trick on Real Linux, in which case this isn't much of a work-around.]

Or if you were feeling highly motivated, attack it head on and try rebuilding libdb without HAVE_MMAP_EXTEND and then when that doesn't work, move on to the the problem and fix that. libdb can be ported to every platform on the planet, including win32 (aka WSL). But effort.

sirredbeard commented 5 years ago

Rebuilding libdb on WSL per @Conan-Kudo's suggestion, though I took his word that there is an auto fallback for mmap:

su -
yum install yumutils rpm-build make gcc-c++
yumdownloader --source libdb
yum-builddep libdb
rpmbuild --rebuild libdb-5.3.21-24.el7.src.rpm
cd rpmbuild/RPMS/x86_64/
rpm -i *.rpm --force

results in:

[root@t470s ~]# rpm -qa rpm
Segmentation fault (core dumped)

We could also per @therealkenc's suggestion to rebuild libdb with patch to force HAVE_MMAP_EXTEND to being not defined.

Also @crramirez suggested the following

sudo -s
cd /var/lib/rpm
rpm --rebuilddb
dd if=/dev/zero of=__db.001 bs=1M count=1
dd if=/dev/zero of=__db.002 bs=1M count=1
dd if=/dev/zero of=__db.003 bs=1M count=1
rpm -qa rpm
yum install vim

sirredbeard commented 5 years ago

I can confirm that @crramirez's approach works to allow yum and rpm to work again after rpm --rebuilddb.

It complains a little bit at the end.

[root@t470s rpm]# rpm --rebuilddb
[root@t470s rpm]# yum install nano
Segmentation fault (core dumped)
[root@t470s rpm]# dd if=/dev/zero of=__db.001 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0029386 s, 357 MB/s
[root@t470s rpm]# dd if=/dev/zero of=__db.002 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0028591 s, 367 MB/s
[root@t470s rpm]# dd if=/dev/zero of=__db.003 bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0028973 s, 362 MB/s
[root@t470s rpm]# yum install nano
Loaded plugins: ovl
Resolving Dependencies
--> Running transaction check
---> Package nano.x86_64 0:2.3.1-10.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

========================================================================================================================
 Package                   Arch                        Version                            Repository               Size
========================================================================================================================
Installing:
 nano                      x86_64                      2.3.1-10.el7                       sl                      439 k

Transaction Summary
========================================================================================================================
Install  1 Package

Total download size: 439 k
Installed size: 1.6 M
Is this ok [y/d/N]: y
Downloading packages:
nano-2.3.1-10.el7.x86_64.rpm                                                                     | 439 kB  00:00:03
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : nano-2.3.1-10.el7.x86_64                                                                             1/1

Rpmdb checksum is invalid: dCDPT(pkg checksums): nano.x86_64 0:2.3.1-10.el7 - u

yum clean all

does not remove checksum message.

daviesalex commented 5 years ago

If it helps, we had exactly this problem on RHEL7.3, but upgrading to 7.5 magically solved our problem:

[root@LJOIT-ADE3 wsl]# rpm -q rpm
rpm-4.11.3-32.el7.x86_64
[root@LJOIT-ADE3 wsl]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.5 (Maipo)
[root@LJOIT-ADE3 wsl]# rpm -q rpm
rpm-4.11.3-32.el7.x86_64
[root@LJOIT-ADE3 wsl]# rpm --initdb
[root@LJOIT-ADE3 wsl]# rpm -q rpm
rpm-4.11.3-32.el7.x86_64
[root@LJOIT-ADE3 wsl]#

The version of RPM you are using is not that different, so I wonder if there is something else going on - but i'm not aware of much.

sirredbeard commented 5 years ago

@daviesalex Huh. Interesting. Currently I can reproduce the bug on RHEL 7.6 and SL 7.6.

sirredbeard commented 5 years ago

I tried placing a DB_CONFIG in /var/lib/rpm containing:

set_flags DB_NOMMAP

Still no change in behavior.

Conan-Kudo commented 5 years ago

@sirredbeard rpmdb config is controlled via rpm macros, not via DB_CONFIG. You can see the current flags set by looking at %_dbi_config setting in /usr/lib/rpm/macros.

If you want to override the setting, write out a file /usr/lib/rpm/macros.d/macros.wsl-rpmdb with the following contents:

# Override the bdb dbi config for WSL
# Cf. https://github.com/WhitewaterFoundry/WLE/issues/20
%_dbi_config nommap %{?__dbi_other}

That should do the trick.

sirredbeard commented 5 years ago

So we have a workaround that can fix broken rpm databases. That is very good.

I still don't know if we can say for certain whether this issue is the mmap syscall issue, because the usual indicators are not there in the straces. If I am wrong, please correct me in any of this.

@daviesalex was able to avoid this issue in a docker pull from 7.5 and he is sending me that to look over and compare tomorrow. That touches on the possibility that this is still an issue with our build. I reviewed our kickstart files and build scripts and we don't touch /var/lib/rpm or anything related. However some of the error messages I encountered when debugging were similar to errors reported in some Docker builds, so it still a possibility. BTW, the solution in most of those cases was to run rpm --rebuilddb.

The issue still seems to be centered around BerkeleyDB though. There have been suggestions to patch libdb. For several reasons that is a worst-case scenario, I would much rather just make a simple change to the build image if possible.

One possible route: Turns out we can tweak BerkeleyDB quite extensively with config files. This may be a better option than rebuilding libdb. Maybe.

How DB_CONFIG works: https://docs.oracle.com/cd/E17076_05/html/programmer_reference/env_db_config.html:

Almost all of the configuration information that can be specified to DB_ENV class 
methods can also be specified using a configuration file. If a file named DB_CONFIG 
exists in the database home directory, it will be read for lines of the format NAME VALUE.

The DB_ENV Handle https://docs.oracle.com/cd/E17076_05/html/api_reference/C/env.html DB_ENV->set_flags() https://docs.oracle.com/cd/E17076_05/html/api_reference/C/envset_flags.html

Conan-Kudo commented 5 years ago

I'm not sure rpm respects DB_CONFIG... @ffesti, @pmatilai?

sirredbeard commented 5 years ago

I think I might be onto something...

c80bb624-972d-4469-9aba-fdacf66bb8af png

Also @Conan-Kudo I am pretty sure it does because a typo in a DB_CONFIG will throw a message even when yum is called for an yum install.

Conan-Kudo commented 5 years ago

@sirredbeard Well, you can also set it via the rpmdb flags as I described in https://github.com/WhitewaterFoundry/WLE/issues/20#issuecomment-467653898, which also automatically applies to chroots created by tools like mock in the environment.

sirredbeard commented 5 years ago

@Conan-Kudo I see. We'll test both approaches if this bears fruit.

therealkenc commented 5 years ago

DB_NOMMAP applies to read-only scenarios. Your scenario (rpm --rebuilddb) isn't.

sirredbeard commented 5 years ago

@therealkenc rpm --rebuilddb exits 0 with no evidence of anything errant in it's strace, so I was not trying to manipulate it's behavior. My hypothesis was that problem might be in memory handling on read, when rpm queries database after rpm --rebuilddb runs.

therealkenc commented 5 years ago

My hypothesis was that problem might be in memory handling on read

Sure way to find out would be to copy that database after the rpm --rebuilddb onto a Real Linux VM and see if it is okay. I'm thinking it isn't, and rpm -q is gonna fail on a known-working rpm / libdb / libc / kernel. This is basically how WSL#2852 went nowhere. Guy attached a corrupt database and I kept trying to tell him "yeah, it's corrupt alright". Maybe attach a before and after zip of rpm-4.11.3-35.el7.x86_64 just to have test vectors available.

This all said I am coming around to the idea it isn't mmap(2). Looking at the strace log a second time looking for something errant (as you say) and I can't avoid the fact it's clean. Normally with mapped files you aren't going to get so see anything because the "data writes" happen through buffer pointers, and those don't leave a syscall trace. But of the 176 mmap() calls, I'm not even seeing a writable mapping that matters to the database right now.

pmatilai commented 5 years ago

DB_PRIVATE flag to Berkeley DB environment open reportedly works around WSL mmap() brokenness.

pmatilai commented 5 years ago

So AIUI the problem is that WSL claims to support something that it doesn't (at least fully), MAP_SHARED semantics perhaps. It wouldn't be anything new, the BDB is a bit of a sucker for VM bugs, it has tripped up on a few in Linux VM over the years as well, and it's something that wont show up in strace.

sirredbeard commented 5 years ago

@daviesalex Sent over some documentation how they built a working RHEL 7.5 image without this issue from a docker pull. We will look at that.

We will also look at the patches OpenSUSE has made, though advice from Red Hat is that we don't want to go there.

In the mean time the following DB_CONFIG in /var/lib/rpm seems to address this issue some of the time. More testing is needed:

set_open_flags DB_PRIVATE on
set_flags DB_NOMMAP
set_cachesize 0 10 1

The above still results in:

[root@t470s rpm]# yum install nano
Loaded plugins: ovl
Resolving Dependencies
--> Running transaction check
---> Package nano.x86_64 0:2.3.1-10.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

========================================================================================================================
 Package                   Arch                        Version                            Repository               Size
========================================================================================================================
Installing:
 nano                      x86_64                      2.3.1-10.el7                       sl                      439 k

Transaction Summary
========================================================================================================================
Install  1 Package

Total download size: 439 k
Installed size: 1.6 M
Is this ok [y/d/N]: y
Downloading packages:
nano-2.3.1-10.el7.x86_64.rpm                                                                     | 439 kB  00:00:01
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : nano-2.3.1-10.el7.x86_64                                                                             1/1
error: rpmdb: BDB3017 unable to allocate space from the buffer cache
error: db5 error(12) from dbcursor->c_put: Cannot allocate memory
error: error(12) adding header #150 record
nano-2.3.1-10.el7.x86_64 was supposed to be installed but is not!
  Verifying  : nano-2.3.1-10.el7.x86_64                                                                             1/1
  Verifying  : nano-2.3.1-10.el7.x86_64                                                                             2/1

Failed:
  nano.x86_64 0:2.3.1-10.el7

Complete!
[root@t470s rpm]#

sirredbeard commented 5 years ago

Following up on @therealkenc's idea here I have generated a CentOS-based build of WLinux Enterprise to compare with a known good CentOS server install. You can download .appx of my build here: https://1drv.ms/u/s!AspPK83V8Sf2hvEoloVoF9rwgpcq5A.

/var/lib/rpm in this build is set to:

set_flags DB_NOMMAP
set_cachesize 0 10 1

Here are zip of /var/lib/rpm on default and after rpm --rebuilddb:

I overwrote /var/lib/rpm on the known good CentOS server with both sets of /var/lib/rpm from above and both worked fine, no errors.

therealkenc commented 5 years ago

CentOS-based build of WLinux Enterprise

Well, I did say rpm / libdb / libc / kernel. But alright, it worked regardless using the WSL kernel. The test vector ("it") is different, assuming defaultinstallrpm != rpm-4.11.3-35.el7.x86_64. Are you saying the OP reproduction steps with the same input .rpm works on WLinux/CentOS?

If you've got a .rpm that works on WLinux/CentOS (and probably Ubuntu/Bionic using their rpm package too) but not WLE/RHEL, then this boils down to determining the difference in the other three parts (rpm / libdb / libc). Which is where I was trying to get to with WSL#2852 before it went downhill.

sirredbeard commented 5 years ago

Not sure I follow, @therealkenc.

This issue is equally reproducible across CentOS, Scientific Linux, Oracle, and RHEL current 7.6. They are all identical versions of packages, RHEL's may be slightly newer.

I used a CentOS WLE for this test so it would match my known good CentOS VPS.

Only OpenSUSE has figured this out. Red Hat folks say they did it by patching extensively. On my list of plans is to diff the code from the two distros for rpm and libdb, see if I could find anything.

I have also reached out to some known Berkeley DB experts and the folks at Oracle.

If we get an answer that it's a syscall issue, then we can talk to Craig and Ben.

sirredbeard commented 5 years ago

Here are some procmon logs taken at various points described in the file name.

procmon.zip

Conan-Kudo commented 5 years ago

openSUSE's rpm package is built very differently from pretty much all other distributions.

For one, they have a vendored copy of Berkeley DB 4.8, with a few patches for it to be built inside the RPM tree.

The other changes they make to the rpm package related to the rpmdb are the following:

Implement nofsync:
- https://build.opensuse.org/package/view_file/Base:System/rpm/db.diff?expand=1
- https://build.opensuse.org/package/view_file/Base:System/rpm/dbfsync.diff?expand=1
Use nofsync by default (note, this change is mangled in with a bunch of other unrelated goop):
- https://build.opensuse.org/package/view_file/Base:System/rpm/macrosin.diff?expand=1
Use DB_PRIVATE:
Allow DB reads to be interruptable:
- https://build.opensuse.org/package/view_file/Base:System/rpm/dbrointerruptable.diff?expand=1

The major difference is that openSUSE rpm uses global locking with DB_PRIVATE, which only works if you statically link bdb into librpm. It's a bad idea to do this otherwise.

Perhaps the nofsync might be meaningful? I hope that's not the case, but if it is, that's going to suck, because it reduces the reliability of rpmdb considerably...

therealkenc commented 5 years ago

Not sure I follow, @therealkenc....This issue is equally reproducible across CentOS

Sorry, we crossed wires. Works on your CentOS VPS. Doesn't work on your CentOS WSL. Your quote "both worked fine no errors" threw me. Your "known good" scenario being a VPS was lost in translation. [This was compounded by comment (not yours) that implied there exists in the universe working rpm based distros on WSL.]

therealkenc commented 5 years ago

openSUSE's rpm package is built very differently from pretty much all other distributions.

@Conan-Kudo - Did they (SUSE) go to the trouble of statically linking a vendored copy of Berkeley DB 4.8 etc etc specifically to make WSL happy, or for other reasons?

Conan-Kudo commented 5 years ago

@therealkenc Nah, they've been doing that for many, many years. The last time that stuff was touched was in 2011, I believe. Predates WSL by quite a bit. :)

sirredbeard commented 5 years ago

@Conan-Kudo

it reduces the reliability of rpmdb considerably...

That is precisely what Red Hat said too.

It just so happens to avoid this issue as well.

Even if we don't fix this completely then rpm still works for 99% of use. If rpmd becomes corrupted then users will be limited to simply resetting WLE, with the option to rebuild and using @crramirez's workaround to restore enough functionality if required to move data off.

@therealkenc

Does anything jump out at you in the procmon logs? I am adding etl files for each of the 4 captured events too. Would you mind taking a look? I appreciate it.

etls_rpmbuild_withbdconfig.zip etls_rpmrebuild_nobdconfig.zip etls_yumafterrpmrebuild-noBDconfig-segfault.zip etls_yumafterrpmrebuild-withBDconfig.zip

therealkenc commented 5 years ago

Does anything jump out at you in the procmon logs?

Give me a couple of days; maybe the weekend. I have a soft spot for the particular problem. I didn't take a run at this previously because I don't actually use any rpm-based distros and we're probably going to end up at WSL dupe#something for the effort. Anyway right now I am ass deep in mouse problems on a CentOS VM in prep for tracking the diverge. It's a start.

sirredbeard commented 5 years ago

@therealkenc

Thank you so much, I really appreciate it.

pmatilai commented 5 years ago

The major difference is that openSUSE rpm uses global locking with DB_PRIVATE, which only works if you statically link bdb into librpm. It's a bad idea to do this otherwise.

Bollocks. There's a lot of misinformation and misunderstanding going on here.

DB_PRIVATE has nothing to do with static linking. The problem with it is that it effectively disables all BDB-level locking on concurrent access because the locking data is in the environment (those __db. files) which is shared by all clients accessing the database, but DB_PRIVATE makes it use in-memory environment. Which is fine, IFF* you take care of the locking by other means. Which Suse does in their rpm: their patches effectively disable the finer-grained locking of BDB with an rpm-level fcntl() lock that permits multiple readers but only a single writer to the database. Which is still absolutely fine, but in order to preserve the ability to perform rpm queries from within rpm scriptlets, they now need to "suspend" the writer lock during scriptlets. Which is where I think it gets icky.

Like I said in an earlier comment, AIUI the only "magic" with Suse's rpm working inside WSL is that DB_PRIVATE mode, because it avoids the much trickier to implement (for a kernel, thus WSL here) shared memory map mode.

daviesalex commented 5 years ago

Its probably worth updating this issue with our findings (some of which we have shared privately with @sirredbeard

We had this issue with our RHEL 7.3 image and effectively gave up on WSL as I fear many have. We also spoke to RH and got the same support response that SUSE have hacked up their yum and RH are not willing to do so, and it was impossible that RHEL would ever work on WSL. They pointed us to https://bugzilla.redhat.com/show_bug.cgi?id=1668380 which ends with "Yeah, it "works" because they carry a patch to the shared environment of Berkeley DB (essentially disabling BDB level locking on concurrent access, the same as we do for unprivileged users) and then a bunch of other patches to try and deal with the consequences.". RH have stated that they have no intention of making changes to RHEL to make this work, and no engineering resources to work on it, which is a shame but good to have clarity on.

Last week I tried to re-deploy WSL using our RHEL 7.5 image (and a clean non-corpinstall of Windows), and the problem is not happening. We then realized that it works perfectly well on corporate windows installs - somehow our RHEL 7.5 install is now working reliably with yum. We now have tens of users for WSL and this problem has not happened to any of them in nearly a week. The question is why when others are having this problem on RHEL/SL 7.6.

One thing we changed between 7.3 and 7.5 image for us was adding the yum ovl plugin. We did this not because of WSL; we were having some bdb issues after running yum in all 7.5 docker images (on Linux) when the docker storage driver was overlay2, the current default. Either running rpm --rebuilddb after any yum operations or using the ovl plugin made those bdb issues “go away”. On WSL, removing yum-plugins-ovl does not break it, so this could be a red herring, but I thought i'd mention it.

We have a lightly modified version of an older version of this script to generate our RHEL7 base images: https://github.com/moby/moby/blob/master/contrib/mkimage-yum.sh

Somehow though something about what we have done is working, and some sort of bisect should be able to figure out what it is. I'm about to travel for a few weeks so wont have any time to look at this, but hopefully these details help others.

sirredbeard commented 5 years ago

@daviesalex I forwarded what you sent over to @nunix to see if he could duplicate your success. He does a lot of work making WSL images with Docker. He is still hitting the seg fault issues. That's not to say this isn't a valuable line of inquiry that could yield some results but we are going to need to work on this more.

Here is what he reported back to me via Twitter DM:

I did the following test:
1. perform a rebuilddb inside the docker instance itself first

hv7cfvj_

then export the container (docker export) and import it in wsl (wsl --import)
redo the same test

opt1gnci

all 3 distros end in segmentation fault

further check, I compared with a container and I can see that "only" db.001 is actually impacted

1qbsmaqu

I then asked him:

Can you tell me a little bit more about why you think the issue is just 001?

He replied:

if you look at the sizes of the __db* only the 001 is not the same

then I used your workaround of `dd` only on 001 and everything works fine again

been doing some try and fails with a lot of things, and I'm really wondering if
it's not a problem of FS actually ... because I could redo everything without issues
on the docker instance

Note: dd'ing 001 works, as documented here, but still creates rpmdb checksum issues after a yum install..

sirredbeard commented 5 years ago

At the suggestion of an Oracle dev on Berkeley DB I have opened a thread on the Oracle forums here that summarizes this issue and links back here. Post is still pending moderator approval.

daviesalex commented 5 years ago

Our RH TAM and I did a little more digging here. We came to this issue originally because we simply could not get yum to work (yum install, etc.) on a new install - at all. This made WSL totally unusable for users (for RHEL).

I have now realized that we have not in fact fixed the "rpm --rebuilddb" issue, BUT the way we built our image means that yum just works and no users ever really need to run that. What is odd is that things like rpm --initdb exit 0 (as does --rebuilddb) BUT does not in fact work; shell output:

[root@LJOIT-ADE3 JumpIT]# rpm -qa | tail -1 
openssh-clients-7.4p1-16.el7.x86_64 
[root@LJOIT-ADE3 JumpIT]# 
[root@LJOIT-ADE3 JumpIT]# rpm --rebuilddb 
[root@LJOIT-ADE3 JumpIT]# echo $? 
0
[root@LJOIT-ADE3 JumpIT]#  
[root@LJOIT-ADE3 JumpIT]# rpm -qa | tail -1                                                                                                                                           
[root@LJOIT-ADE3 JumpIT]#

Notice that final rpm returns nothing - the DB is at this point broken. I did not find the workaround dd if=/dev/zero of=__db.00X bs=1M count=1 effective, either.

Its unclear to me why we had this problem even before running rebuilddb on 7.3 and dont have it now, but most likely that is a legitimate "how we built the image/ how the RPM DB ended at the end of it" type issue. For the purposes of this/the berkeleydb issue our experience is noise.

However, for the purpose of "I want to make this work", the way we did this is probably effective for others - clone a working image with a working RPM DB, and to all intents and purposes it works to the point that your users just wont notice (how often do you run a RPM rebuild in normal operation on a desktop?). We might even do something horrific like alias rpm --rebuilddb to print "Dont do this" in our image to really ensure it because as noted above the workaround to dd the files does not actually seem to work for us so if you run it once, you are back to starting with a clean image ;-)

Thanks for writing up the berkeleydb issue - i'll keep an eye on that too. We would obviously like this fixed properly long term too!

hartsjc commented 5 years ago

In regards to the work-around, I have found that it will not work if you have yum/rpm command running in that WSL instance. So before doing the work-around really should be sure to kill all these off, or maybe safer way would be go to windows task manager and kill off all the init processes. Then start new wsl instance and do the dd command work-around.

nunix commented 5 years ago

just been trying something and it seems to work, but please test it on your end:

rpm -q rpm # works
rpm --rebuilddb # works
rpm -q rpm # fails
dd if=/dev/zero of=/var/lib/rpm/__db.001 bs=1M count=1 # works
rpm -q rpm # works
yum install wget # fails <-- actually it installs but gets an error message still, so I set it as fail
rpm --setperms wget # fails

ok, untill here, everything seems broken, however I continued as follow:

yum reinstall --downloadonly --downloaddir=/tmp wget # works
rpm -i --replacepkgs /tmp/wget*.rpm # works
rpm --setperms wget # works

so by re-installing the package "locally" seems to at least move a bit forward. Still far for perfect, but one more baby step I guess

sirredbeard commented 5 years ago

Here is an additional workaround that has been submitted:

put this in DB_CONFIG

add_data_dir /var/lib/rpm
set_create_dir /var/lib/rpm
set_flags DB_NOMMAP on
set_flags DB_CDB_ALLDB 
set_flags DB_DSYNC_DB on
set_flags DB_REGION_INIT off
set_cachesize 0 42949672 2
set_flags DB_TXN_NOSYNC off

then try:

cd /var/lib/rpm
db_recover -e 
db_verify Packages
sb_stat -m
db_hotbackup -h /var/lib/rpm -b /tmp

sirredbeard commented 5 years ago

We have been working with partners on the Berkeley DB team at Oracle on this issue, sadly not the rpm team at Red Hat. Oracle team thinks they have nailed down the issue to be mmap handling in WSL after all:

"[WSL] has a bug in mmap when the underlying file is extended, the extended part of the mapping is actually mapped back to the beginning of the file. This is why BDB would crash when it extended the size of the file that backed the in-memory cache (one of the __db.### files), and why setting the cache size to a small value works as a work around" - Dr. Lauren Foutz, Oracle

See attached C code which replicates the issue: mmap_extend.c.txt

I have e-mailed Craig and Ben and asked if they would like us to open another bug on WSL or to go under one of the existing mmap-related issues.

In the mean time we are working on a wrap-around for rpm that will backup and restore good rpmdb components around commands that are known to break the rpmdb.

therealkenc commented 5 years ago

That linked repro is more-or-less variation this post from WSL#658 ("mmap's problem") August 2016. Which is pretty much what I expected and why I've been low energy, only to end up showing rpm is the same. There isn't an open issue that I know of (WSL#658 was closed) that is exactly on task. WSL#902 is at best "related" but hardly duplicative. Anyway, the bar for submitting an issue is super low. It is no shame even if it were a dupe. Actually, anything with coherent repro steps is a good day. :)

Thanks for pursuing this so diligently.

sirredbeard commented 5 years ago

Opened an issue in WSL main per direction of Microsoft WSL team.

sirredbeard commented 5 years ago

Per e-mail from Microsoft WSL team a fix is in works for the underlying issue.

WhitewaterFoundry / Pengwin-Enterprise

rpm --rebuilddb causes all future rpm functions to segfault/yum to hang #20