ecdye / zram-config

A complete zram-config utility for swap, directories, and logs to reduce SD, NAND and eMMC block wear.
MIT License
412 stars 53 forks source link

Try to avoid `zramctl --find` race conditon #91

Closed ThomasKaiser closed 2 years ago

ThomasKaiser commented 2 years ago

As reported earlier but at the wrong location I experienced strange behaviour when testing most recent zram-config version with freshly released Ubuntu 22.04/armhf on an Raspberry Pi 4.

It seems there's some sort of a race condition when using zramctl --find and it clearly doesn't help when there's no delay between the two zramctl --find calls. Quick check with a sleep 0.2 in between seemed to fix it.

To get more reliable data I added an automatic reboot to the install to check zram-config behaviour:

root@ubuntu:/home/ubuntu# cat /etc/rc.local 
#!/bin/bash

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

echo >>/var/log/zram-status.log
date >>/var/log/zram-status.log
sleep 60
zramctl | grep -v NAME >>/var/log/zram-status.log
systemctl status zram-config.service | grep busy >>/var/log/zram-status.log
reboot

Test with an unaltered /usr/local/sbin/zram-config showed the following behaviour walking through 10 boot attempts:

/var/log/zram-status.log:

Fri Apr 22 19:15:03 CEST 2022
/dev/zram1 lzo-rle       150M 16.5M  2.6M    5M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:17:00 CEST 2022
/dev/zram0 lzo-rle       150M 16.6M  2.6M  5.1M       4 /opt/zram/zram0
Apr 22 19:17:00 ubuntu zram-config[719]: zramctl: /dev/zram0: failed to reset: Device or resource busy
Apr 22 19:17:00 ubuntu zram-config[728]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:19:02 CEST 2022
/dev/zram1 lzo-rle       150M 16.8M  2.7M  5.2M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:19:02 ubuntu zram-config[714]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:21:00 CEST 2022
/dev/zram1 lzo-rle       150M 14.9M  2.7M  5.1M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:22:59 CEST 2022
/dev/zram0 lzo-rle       150M 15.1M  2.8M  5.2M       4 /opt/zram/zram0
Apr 22 19:22:59 ubuntu zram-config[709]: zramctl: /dev/zram0: failed to reset: Device or resource busy
Apr 22 19:22:59 ubuntu zram-config[721]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:24:57 CEST 2022
/dev/zram1 lzo-rle       150M 15.9M  2.9M  5.5M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:26:56 CEST 2022
/dev/zram0 lzo-rle       150M 15.5M  2.9M  5.4M       4 /opt/zram/zram0
Apr 22 19:26:56 ubuntu zram-config[712]: zramctl: /dev/zram0: failed to reset: Device or resource busy
Apr 22 19:26:56 ubuntu zram-config[726]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:28:53 CEST 2022
/dev/zram1 lzo-rle       150M 15.6M    3M  5.5M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:30:53 CEST 2022
/dev/zram1 lzo-rle       150M 16.4M  3.1M  5.7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:32:50 CEST 2022
/dev/zram1 lzo-rle       150M  16M  3.1M  5.7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M   4K   87B    4K       4 [SWAP]
Apr 22 19:32:50 ubuntu zram-config[716]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Adding a simple sleep 0.1 between the two zramctl --find calls improves things while still 6 errors occur but in 11 tests always both zram devices could be created:

Fri Apr 22 19:34:47 CEST 2022
/dev/zram1 lzo-rle       150M 16.9M  3.3M    6M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:36:46 CEST 2022
/dev/zram1 lzo-rle       150M 18.5M  3.3M    6M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:36:46 ubuntu zram-config[705]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:38:46 CEST 2022
/dev/zram1 lzo-rle       150M 16.6M  3.4M    6M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:40:44 CEST 2022
/dev/zram1 lzo-rle       150M 16.8M  3.4M  6.1M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:42:42 CEST 2022
/dev/zram1 lzo-rle       150M  17M  3.5M  6.1M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M   4K   87B    4K       4 [SWAP]
Apr 22 19:42:41 ubuntu zram-config[709]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:44:40 CEST 2022
/dev/zram1 lzo-rle       150M 17.2M  3.6M  6.3M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:44:40 ubuntu zram-config[714]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:46:38 CEST 2022
/dev/zram1 lzo-rle       150M 17.3M  3.6M  6.3M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:46:38 ubuntu zram-config[712]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:48:38 CEST 2022
/dev/zram1 lzo-rle       150M 17.6M  3.7M  6.4M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:48:38 ubuntu zram-config[703]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:50:37 CEST 2022
/dev/zram1 lzo-rle       150M 17.8M  3.8M  6.6M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 19:52:36 CEST 2022
/dev/zram1 lzo-rle       150M  18M  3.9M  6.7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M   4K   87B    4K       4 [SWAP]

Fri Apr 22 19:54:35 CEST 2022
/dev/zram1 lzo-rle       150M 18.7M    4M  6.9M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:54:35 ubuntu zram-config[721]: zramctl: /dev/zram0: failed to reset: Device or resource busy

The 0.1 second delay between both zramctl calls seems to do the job. Now for some extra safety headroom testing with sleep 0.2:

Fri Apr 22 19:56:32 CEST 2022
/dev/zram1 lzo-rle       150M 20.5M  4.1M    7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:56:32 ubuntu zram-config[699]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 19:58:30 CEST 2022
/dev/zram1 lzo-rle       150M 18.5M  4.1M  6.9M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 19:58:30 ubuntu zram-config[713]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:00:27 CEST 2022
/dev/zram1 lzo-rle       150M 18.7M  4.1M    7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 20:02:26 CEST 2022
/dev/zram1 lzo-rle       150M 18.9M  4.2M  7.1M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:02:26 ubuntu zram-config[709]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:04:25 CEST 2022
/dev/zram1 lzo-rle       150M 19.1M  4.3M  7.2M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:04:25 ubuntu zram-config[714]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:06:23 CEST 2022
/dev/zram1 lzo-rle       150M 19.3M  4.3M  7.3M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 20:08:22 CEST 2022
/dev/zram1 lzo-rle       150M 19.4M  4.4M  7.4M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:08:22 ubuntu zram-config[710]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:10:20 CEST 2022
/dev/zram1 lzo-rle       150M 19.6M  4.5M  7.5M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:10:20 ubuntu zram-config[710]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:12:18 CEST 2022
/dev/zram1 lzo-rle       150M 20.5M  4.6M  7.8M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Fri Apr 22 20:14:15 CEST 2022
/dev/zram1 lzo-rle       150M 20.1M  4.6M  7.7M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:14:15 ubuntu zram-config[708]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Fri Apr 22 20:16:12 CEST 2022
/dev/zram1 lzo-rle       150M 20.9M  4.8M    8M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]
Apr 22 20:16:11 ubuntu zram-config[705]: zramctl: /dev/zram0: failed to reset: Device or resource busy

Again 11 times both zram devices created and 8 times an error at the first zramctl --find attempt occured (most probably not related to the delay but 'result variation' since most likely happening at the 1st zram --find call)

I'll run the test for the next few hours (~30 reboots per hour) and report back.

ecdye commented 2 years ago

I'm not sure how the reset is related to the so called race condition you think you are spotting. The real issue with the reset is, if there are any programs accessing a folder a zram device was mounted to then they must be stopped prior to the zram service being stopped. If they are not or there is a phantom that is cleaning up its files while stopping the zram service the reset of the device will fail because it will still think that there is a program trying to access it even though the folder was already unmounted.

Secondly, when rebooting the zram module is not always unloaded and then reloaded. This means that if the operating system did not allow of proper cleanup of zram devices that they will still be left over on startup.

In the end I'm not really sure what you're getting at here. Adding a 0.2 second delay between trying to find devices is unlikely to accomplish anything and I don't believe that the issue is actually in finding the devices as the only real issue would be if all your RAM was already used.

ThomasKaiser commented 2 years ago

Whole zram-status.log.

I ran the test with sleep 0.2 between both zramctl --find calls between 'Fri Apr 22 20:18:09 CEST 2022' and 'Sat Apr 23 07:22:03 CEST 2022'. This resulted in 320 boot attempts and all the time both /dev/zram0 and /dev/zram1 have been successfully created. As per /etc/ztab all 320 times /dev/zram0 was a 750MB swap device and /dev/zram1 a 150MB log partition:

NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram1 lzo-rle       150M 95.5M 32.1M 44.8M       4 /opt/zram/zram1
/dev/zram0 lzo-rle       750M    4K   87B    4K       4 [SWAP]

Then I removed the sleep statement from /usr/local/sbin/zram-config and let another 64 boot attempts run (starting from 'Sat Apr 23 07:24:09 CEST 2022' in the log). /dev/zram0 has been created all 64 times but /dev/zram1 was created only 38 times (as such missing 26 times).

In these 64 attempts 21 times creation of the swap device failed and as such the 1st successful created zram device /dev/zram0 ended up being the 150MB log partition.

That's what I'm actually reporting: Freshly installed Ubuntu 22.04 armhf on a Raspberry Pi 4 running latest zram-config version fails sporadically to create zram devices (successful creation of both zram devices only in 60% of tests). A slight delay between 1st and 2nd zramctl --find fixes this behaviour. Confirmed with +300 boot tests.

This is neither about shutdown behaviour nor 'resetting' zram devices and also not about zram devices surviving reboots (Huh?). Just zramctl --find failing sporadically when /usr/local/sbin/zram-config start is called at booting. Which then looks like this for example:

root@ubuntu:/home/ubuntu# systemctl status zram-config.service 
● zram-config.service - zram-config
     Loaded: loaded (/etc/systemd/system/zram-config.service; enabled; vendor preset: enabled)
     Active: active (exited) since Sat 2022-04-23 12:02:05 CEST; 3min 9s ago
       Docs: https://github.com/ecdye/zram-config/blob/main/README.md
    Process: 657 ExecStart=/usr/local/sbin/zram-config start (code=exited, status=0/SUCCESS)
   Main PID: 657 (code=exited, status=0/SUCCESS)
        CPU: 339ms

Apr 23 12:02:05 ubuntu systemd[1]: Starting zram-config...
Apr 23 12:02:05 ubuntu systemd[1]: Started zram-config.
Apr 23 12:02:05 ubuntu zram-config[670]: zram-config start 2022-04-23-12:02:05-CEST
Apr 23 12:02:05 ubuntu zram-config[657]: createZdevice: Beginning creation of zDevice.
Apr 23 12:02:05 ubuntu zram-config[723]: zramctl: /dev/zram0: failed to reset: Device or resource busy
Apr 23 12:02:05 ubuntu zram-config[732]: createZdevice: Failed to find an open zram device. Exiting!

The zramctl: /dev/zram0: failed to reset: Device or resource busy output is not me doing a shutdown or manually 'resetting' something but just the result of your script using RAM_DEV="$(zramctl --find --size "$DISK_SIZE" --algorithm "$ALG" | tr -dc '0-9')" in the createZdevice function in line 9 (so actually the very first zramctl --find call already generates this failed to reset: Device or resource busy message).

ThomasKaiser commented 2 years ago

At 12:30 I deleted /usr/local/share/zram-config/log/zram-config.log (too much clutter in there) and startet another run with unmodified /usr/local/sbin/zram-config.

Stopping at 14:00 this resulted in 43 boot attempts: 43 times /dev/zram0 successfully created but only 18 times /dev/zram1 (or in other words: 60% of times one zram device creation failed).

Complete contents of /usr/local/share/zram-config/log/zram-config.log: 1st failure recorded at 3rd boot: zram-config start 2022-04-23-12:36:29-CEST.

ecdye commented 2 years ago

I now understand what you are getting at, however this is still very odd to me as it only appears to affect Jammy. I have never seen this on any other distro. I wonder if this is a bung in the particular version of zramctl or the kernel that they use. I also wonder if perhaps it would be better to just switch back to configuring it manually without using zramctl if that is what is causing the issue.

ecdye commented 2 years ago

@ThomasKaiser I have switched to using the sysfs attributes to configure zram devices. Would you please check and see if that resolves your issue.

ThomasKaiser commented 2 years ago

Giving it a try now and am reporting back in an hour...

ThomasKaiser commented 2 years ago

I tested in an automated way as before and stopped after 17 successful zram device creations.

It looked like this all the time:

root@ubuntu:/home/ubuntu# zramctl 
NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram2 lzo-rle       150M 83.3M 27.3M 39.4M       4 /opt/zram/zram2
/dev/zram1 lzo-rle       750M    4K   87B    4K       4 [SWAP]

In other words: all the time an additional /dev/zram0 was also created but not used:

root@ubuntu:/home/ubuntu# ll /dev/zram*
brw-rw---- 1 root disk 252, 0 Apr 25 21:43 /dev/zram0
brw-rw---- 1 root disk 252, 1 Apr 25 21:43 /dev/zram1
brw-rw---- 1 root disk 252, 2 Apr 25 21:43 /dev/zram2
ecdye commented 2 years ago

Ok, that's an easy problem to solve.

ThomasKaiser commented 2 years ago

Thank you! The modprobe parameter works and zram device creation starts with /dev/zram0 now. Timing behaviour comparing zramctl vs. sysfs also remained the same.

Now this PR has become kinda useless so I'm simply suggesting another time an adjustment of function names and log messages since 'zswap' references are misleading when we solely deal with 'zram'. Both are entirely different things and IMO shouldn't be confused: http://ix.io/3Wh0/diff

ecdye commented 2 years ago

The reason for the function naming was not mine, that is how it has been since I took over the project. I left it that way because it helps to distinguish between the different types of zram that we set up. I personally would prefer to leave it because those referenced functions are only used to set up a zram swap. Unless you have a compelling argument as to why we shouldn't I would prefer to leave it.

ThomasKaiser commented 2 years ago

Quoting @StuartIanNaylor: zram & zswap are apples & pears. But a few sentences away it's confusing: 'StuartIanNaylor/zram-config does far more than just zswap'.

The only persons affected by internal function names are those interested in how stuff works and looking inside. And I don't think it's a good service to them creating the impression zram and zswap would be the same since they really aren't. But an awful lot of people do already confuse both and this should stop.

ecdye commented 2 years ago

They are not the same. I agree with that, but quite frankly the difference is that zram is the underlying linux kernel name for a compressed ram block device. Whereas zswap is built upon the underlying zram to setup a swap space that uses a zram block device. I don't see a problem in this case because the function name is correct, as it is setting up a swap that uses zram. The lines are a little blurry but really zswap is just referring to zram being used as a swap space which often means the zram block device is configured a little differently because of the nature of the swap usage. And yes, I do realize that zswap is a separate kernel feature but there is AFIK very little performance difference in the RPi type application that this program targets.

As an aside, the saying that 'StuartIanNaylor/zram-config does far more than just zswap'. is really just another way of saying that zram-config allows you to configure all types of zram devices whereas his zswap program only configures swap.

EDIT: Also I think that Stuart's reasoning when naming the function was just shortening the function name because it made it easier than zramSwap to clarify.

ThomasKaiser commented 2 years ago

Whereas zswap is built upon the underlying zram to setup a swap space that uses a zram block device

Ok, so you're just one of those persons I'm talking all the time about...

This is from a system where I switched from zram to zswap over a year ago (some commercial crap application featuring a memory leak made this necessary):

root@athene:/sys/kernel/debug/zswap# cat /proc/swaps 
Filename                Type        Size    Used    Priority
/dev/sdb1                               partition   16775164    2092744 -2

root@athene:/sys/kernel/debug/zswap# zramctl 

root@athene:/sys/kernel/debug/zswap# grep -R .
same_filled_pages:51224
stored_pages:373418
pool_total_size:699469824
duplicate_entry:0
written_back_pages:87560
reject_compress_poor:0
reject_kmemcache_fail:0
reject_alloc_fail:0
reject_reclaim_fail:7
pool_limit_hit:44768

I even had to write my own monitoring plugin since nothing for zswap was existing.

Bildschirmfoto 2022-04-28 um 07 53 10

With zswap of course no zram (device) used since it's something entirely different. Even the basic strategies differ completely: zram is avoiding 'swap to disk' (nobody uses a backing file) and zswap is 'making swap on disk more efficient'.

This VM with zram needed 8GB and locked up from time to time. Now with zswap (and restarting the crappy application once a week) we're fine with half the assigned RAM.

BTW... this sentence of yours already really scared me:

when rebooting the zram module is not always unloaded and then reloaded. This means that if the operating system did not allow of proper cleanup of zram devices that they will still be left over on startup.

ThomasKaiser commented 2 years ago

And another BTW: Ubuntu Desktop 22.04 for RPi was said to ship with enabled zswap as default: https://www.cnx-software.com/2022/01/13/ubuntu-22-04-zswap-raspberry-pi-4-2gb-ram/ (you should read this also if you want to get a brief start on how zram and zswap differ).

Using zswap instead of zram on the RPi 4 might (now) make some sense as long as the device where the swap files reside is not an SD card (but an USB3 SSD or a NVMe SSD with Compute Module 4).

At least my fresh Jammy install doesn't use zswap (the zswap parameters are missing in cmdline.txt) but good luck from now on if you will still insist that zram and zswap are more or less the same thing. They are not, they're even mutually exclusive.

ecdye commented 2 years ago

Whereas zswap is built upon the underlying zram to setup a swap space that uses a zram block device

Ok, so you're just one of those persons I'm talking all the time about...

No, I understand how my comment made it seem like that, but that was just poor wording on my part. That is why I tried to edit it to clarify that I believe the Stuart originally named it that because it was shorter and simpler.

With zswap of course no zram (device) used since it's something entirely different. Even the basic strategies differ completely: zram is avoiding 'swap to disk' (nobody uses a backing file) and zswap is 'making swap on disk more efficient'.

This VM with zram needed 8GB and locked up from time to time. Now with zswap (and restarting the crappy application once a week) we're fine with half the assigned RAM.

I get it, I haven't tried zswap on a RPi in some time because of the issues you have mentioned and SD card wear. You are correct that it is probably a little confusing for the enterprising user however, at the moment it seems to be simpler to keep it the same instead of changing the function name as what you suggested in this PR would be only partially correct.

BTW... this sentence of yours already really scared me:

when rebooting the zram module is not always unloaded and then reloaded. This means that if the operating system did not allow of proper cleanup of zram devices that they will still be left over on startup.

I understand why that might scare you, I haven't seen this behavior in a while, I think it was fixed, but I have observed it before.