Feh / nocache

minimize caching effects
BSD 2-Clause "Simplified" License
554 stars 53 forks source link

unexpected nocache overhead on trivial commands #50

Closed idallen closed 2 years ago

idallen commented 2 years ago

I made the mistake of applying nocache to a shell script, and things did not go well. Every command in the script had an extra CPU second added to its run time. I wish the documentation had warned me about this:

$ time /bin/true
real    0m0.013s
user    0m0.000s
sys     0m0.013s

$ time nocache /bin/true
real    0m1.017s
user    0m0.647s
sys     0m0.365s

$ time mkdir -p /tmp/foo
real    0m0.013s
user    0m0.001s
sys     0m0.012s

$ time nocache mkdir -p /tmp/foo
real    0m1.002s
user    0m0.645s
sys     0m0.356s

$ time rm -rf /tmp/foo
real    0m0.010s
user    0m0.000s
sys     0m0.011s

$ time nocache rm -rf /tmp/foo
real    0m1.130s
user    0m0.738s
sys     0m0.390s

$ time date
Fri Apr  1 03:46:44 EDT 2022
real    0m0.008s
user    0m0.000s
sys     0m0.008s

$ time nocache date
Fri Apr  1 03:46:47 EDT 2022
real    0m1.093s
user    0m0.779s
sys     0m0.309s

$ time /bin/echo hi
hi
real    0m0.009s
user    0m0.000s
sys     0m0.008s

$ time nocache /bin/echo hi
hi
real    0m1.022s
user    0m0.691s
sys     0m0.328s
pavlinux commented 2 years ago

export NOCACHE_MAX_FDS=16

Overhead is here https://github.com/Feh/nocache/blob/2b6ea1f6b46dabd08db6c6b8be78874b90ecfd22/nocache.c#L193

idallen commented 2 years ago

Setting NOCACHE_MAX_FDS is a small improvement but not good enough to remove most of the overhead:

$ time NOCACHE_MAX_FDS=16 nocache /bin/true

real    0m0.980s
user    0m0.668s
sys     0m0.310s
Feh commented 2 years ago

Why it’s slow: free_unclaimed_pages gets called on all file descriptors that were potentially opened in the past but the library forgot to keep track of. Each call to that function will have to issue two syscalls for signal safety. That adds a constant time of overhead at shutdown. Not noticeable for long-running binaries, but very noticeable for trivial binaries where you expect sub-millisecond execution latency.

I agree the documentation should warn about not just this slowdown, but more generally about using the tool in the first place :) The year is 2022, use cgroups already.

idallen commented 2 years ago

Thanks for updating the README to hint at the problems with nocache. The README is now awkwardly self-contradictory, saying both that the program should and should not be used to control the cache.

Because nocache has existed for a decade, all the Internet searches find articles about it first. Any examples of using cgroups to do the same thing are impossible to find. The first three hits are all for nocache:

https://duckduckgo.com/?q=linux+minimize+backup+cache

Until Internet searches start to show cgroups examples more often than nocache examples, please help us out and make your Alternate Approaches cgroup examples for backups much more prominent in your nocache documentation. Make it super clear that this program is not the right way to do this in a modern system.

Please copy all the great README stuff into the man page.

Because the Internet is full of nocache examples, your new README advice will be invisible to people who simply search for a way to minimize cache impacts, read an article telling them to use nocache, and install the program out of a distribution repository, and only see the man page. They will never see the README. Please also add all the great stuff in the README to the man page, where people will see it.

I would also like to see an explicit statement that nocache adds approximately a second of CPU use to every single program it touches, no matter how trivial (e.g. to /bin/true and /bin/echo). You only hint at this, and telling people just how big the overhead is would be helpful.

pavlinux commented 2 years ago

I see no point in deleting the cache and freeing spinlocks in the destructor when exiting the program, since this will not affect the operation of the program.

my destroy() https://github.com/Feh/nocache/blob/7ffa504a8b5db029155f24e63b86186ebec12533/nocache.c#L193 function look like this

static void destroy(void)
{
    free(fds);   
    free(fds_lock);
 }
Feh commented 2 years ago

Because nocache has existed for a decade, all the Internet searches find articles about it first.

@idallen Fair point, I was not aware of that. I’ll reopen this issue and think about how to present this best (in both the readme and the man page), but it’ll take a few days until I find the time.

Feh commented 2 years ago

OK, finally finding some time to look at this. Updating the documentation first, then trying to figure out why nocache is slow.

Not sure how common systemd is on linux distros today, but it makes it easy to do all this.

Feh commented 2 years ago

OK now for why is the timing overhead so high. First, this is easy to reproduce as in top comment:

$ time /bin/true
/bin/true  0.00s user 0.00s system 0% cpu 0.002 total
$ time ./nocache /bin/true
./nocache /bin/true  0.30s user 0.47s system 99% cpu 0.772 total

So a significant overhead is from system calls it seems, worth digging in first.

Feh commented 2 years ago
$  strace -c -- /bin/true 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0         1           read
  0.00    0.000000           0         2           open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         2           fstat
  0.00    0.000000           0         5           mmap
  0.00    0.000000           0         4           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         1           brk
  0.00    0.000000           0         3         3 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                    23         3 total

With nocache, very obvious what’s expensive here:

$ strace -c -- ./nocache /bin/true 
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.009020           0   2097164           rt_sigprocmask
  0.00    0.000000           0        12           read
  0.00    0.000000           0        13           open
  0.00    0.000000           0        15           close
  0.00    0.000000           0         8           stat
…
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.009020               2097369        14 total

sigprocmask is called for storing info for a new FD, and when cleaning up again. It’s probably the cleanup path mentioned in https://github.com/Feh/nocache/issues/51#issuecomment-1093994553.

Feh commented 2 years ago

Looks like on the right track.

Before:

$ sudo perf stat -r 20 -- ./nocache /bin/true

 Performance counter stats for './nocache /bin/true' (20 runs):

        579.631886      task-clock (msec)         #    0.765 CPUs utilized            ( +-  8.93% )
                 9      context-switches          #    0.015 K/sec                    ( +-  9.88% )
                 3      cpu-migrations            #    0.005 K/sec                    ( +-  4.18% )
            20,795      page-faults               #    0.036 M/sec                    ( +-  0.00% )
     1,991,987,812      cycles                    #    3.437 GHz                      ( +-  0.11% )
     1,159,651,884      instructions              #    0.58  insn per cycle           ( +-  0.00% )
       260,550,653      branches                  #  449.511 M/sec                    ( +-  0.19% )
         2,265,922      branch-misses             #    0.87% of all branches          ( +-  0.88% )

       0.757983617 seconds time elapsed                                          ( +-  0.21% )

With a hacky thing to make destroy faster:

sudo perf stat -r 20 -- ./nocache /bin/true

 Performance counter stats for './nocache /bin/true' (20 runs):

         76.477428      task-clock (msec)         #    0.572 CPUs utilized            ( +- 16.75% )
                 5      context-switches          #    0.065 K/sec                    ( +-  4.10% )
                 3      cpu-migrations            #    0.037 K/sec                    ( +-  3.28% )
            20,797      page-faults               #    0.272 M/sec                    ( +-  0.00% )
       351,317,150      cycles                    #    4.594 GHz                      ( +-  0.49% )
       460,161,228      instructions              #    1.31  insn per cycle           ( +-  0.01% )
       108,516,760      branches                  # 1418.938 M/sec                    ( +-  0.02% )
            71,482      branch-misses             #    0.07% of all branches          ( +-  0.22% )

       0.133648228 seconds time elapsed                                          ( +-  0.65% )
Feh commented 2 years ago

The other bottleneck is probably the mutex game we’re playing at destroy() time. Keeping track of the max FD observed makes that better.

Before:

sudo perf stat -r 20 -- ./nocache /bin/true

 Performance counter stats for './nocache /bin/true' (20 runs):

         70.151545      task-clock (msec)         #    0.523 CPUs utilized            ( +- 17.27% )
                 5      context-switches          #    0.073 K/sec                    ( +-  4.70% )
                 3      cpu-migrations            #    0.041 K/sec                    ( +-  3.84% )
            20,796      page-faults               #    0.296 M/sec                    ( +-  0.00% )
       353,510,541      cycles                    #    5.039 GHz                      ( +-  0.15% )
       459,361,507      instructions              #    1.30  insn per cycle           ( +-  0.01% )
       108,344,797      branches                  # 1544.439 M/sec                    ( +-  0.02% )
            74,239      branch-misses             #    0.07% of all branches          ( +-  2.17% )

       0.134152432 seconds time elapsed                                          ( +-  0.19% )

After:

$ sudo perf stat -r 20 -- ./nocache /bin/true

 Performance counter stats for './nocache /bin/true' (20 runs):

         34.362894      task-clock (msec)         #    0.422 CPUs utilized            ( +- 18.54% )
                 5      context-switches          #    0.146 K/sec                    ( +-  4.10% )
                 3      cpu-migrations            #    0.077 K/sec                    ( +-  4.13% )
            20,796      page-faults               #    0.605 M/sec                    ( +-  0.00% )
       214,371,054      cycles                    #    6.238 GHz                      ( +-  1.35% )
       252,683,994      instructions              #    1.18  insn per cycle           ( +-  0.02% )
        53,811,764      branches                  # 1565.985 M/sec                    ( +-  0.04% )
            71,759      branch-misses             #    0.13% of all branches          ( +-  0.25% )

       0.081439200 seconds time elapsed                                          ( +-  1.30% )
Feh commented 2 years ago

The commits above address the obvious low hanging fruit… I don’t think it’s particularly worth optimizing further.

@idallen let me know if that helps and in particular if the updated README is clear now, thanks!

idallen commented 2 years ago

On Sun, May 01, 2022 at 07:14:17AM -0400, Julius Plenz wrote:

@idallen let me know if that helps and in particular if the updated README is clear now, thanks!

Thank you for your excellent update to both the code and the documentation.

Some minor suggestions:

  1. In the man page:

Add the tag "(obsolete)" to the end of the NAME:

nocache - don't use Linux page cache on given command (obsolete)

Change the first line "The nocache tool tries to minimize" to "The nocache tool was an early (2012) and now obsolete attempt to minimize".

Be more bold about telling people not to use this command in the man page: Rather than the gentle "For more info, see the README" say explicitly "Do not use this command until you have read the README".

  1. In the README file:

The new cgroup examples are really helpful, especially the systemd ones. Can you add systemd documentation references for further reading?

Make the same changes as in the man page: Add "(obsolete)" to the end of the title "nocache - minimize filesystem caching effects" and change the line "The nocache tool tries to minimize" to "The nocache tool was an early (2012) and now obsolete attempt to minimize".

Thank you for taking the time to do all this.

-- | Ian! D. Allen, BA-Psych, MMath-CompSci @.*** Ottawa CANADA | Home: www.idallen.com Contact Improvisation Dance: www.contactimprov.ca | Former college professor of Free/Libre GNU+Linux @ teaching.idallen.com | Improve democracy www.fairvote.ca and defend digital freedom www.eff.org

idallen commented 2 years ago

Using your example systemd-run --scope --property=MemoryLimit=500M, my backup rsync job died when the OOM killer said it used too much memory.

My Ubuntu 20.04LTS man page for systemd.resource-control(5) says that MemoryLimit is deprecated in favour of MemoryMax, and that MemoryHigh is really the one to use because it doesn't kill your job if it goes over the memory limit:

MemoryHigh=bytes
   Specify the throttling limit on memory usage of the executed
   processes in this unit. Memory usage may go above the limit
   if unavoidable, but the processes are heavily slowed down and
   memory is taken away aggressively in such cases. This is the
   main mechanism to control memory usage of a unit.

MemoryMax=bytes
   Specify the absolute limit on memory usage of the executed
   processes in this unit. If memory usage cannot be contained
   under the limit, out-of-memory killer is invoked inside the
   unit. It is recommended to use MemoryHigh= as the main control
   mechanism and use MemoryMax= as the last line of defense.

Are these really a replacement for what nocache used to do? I don't want to limit the program memory needed by my process, I want to limit the file system cache buffers used.

-- | Ian! D. Allen, BA-Psych, MMath-CompSci @.*** Ottawa CANADA | Home: www.idallen.com Contact Improvisation Dance: www.contactimprov.ca | Former college professor of Free/Libre GNU+Linux @ teaching.idallen.com | Improve democracy www.fairvote.ca and defend digital freedom www.eff.org