Add snappy compression to drmemtrace output

derekbruening commented 2 years ago

Xref https://github.com/DynamoRIO/dynamorio/issues/2001#issue-175552118 where zlib with Z_BEST_SPEED was tried on the memtrace_x86 sample and was faster than writing uncompressed to an HDD but slower than an SSD (which was why it was not committed, though maybe it should have been under an option).

Xref #790 where general write-compressed-file support for all clients is requested.

This issue is about adding snappy compression for the direct output of drmemtrace. Snappy should out-perform zlib Z_BEST_SPEED (though with worse compression) and should be a net win for an SSD as well.

This issue can also cover generalizing this snappy compression into a library and to use it by default in the tracing samples.

In addition to improving overall performance, this will also shrink the disk space needed to store raw files for offline traces.

For online traces: this still should be a net win, shrinking the pipe throughput. We already have an offline file reader for snappy; we'd need to use that for the online reader.

derekbruening commented 2 years ago

I implemented snappy compression on drmemtrace raw files and did a quick perf test on spec2k6 mcf test on my laptop which has an SSD:

$ /usr/bin/time bin64/drrun -t drcachesim -offline -raw_compress none -- ~/spec2006/mcf-test/mcf_base.gcc-64bit ~/spec2006/mcf-test/inp.in 
5.85user 7.31system 0:18.31elapsed 71%CPU (0avgtext+0avgdata 167772maxresident)k
1656inputs+28450072outputs (11major+40110minor)pagefaults 0swaps
$ ls -sh ~/dr/git/build_x64_rel_tests/drmemtrace.mcf_base.gcc-64bit.528525.5144.dir/raw
total 14G
14G drmemtrace.mcf_base.gcc-64bit.528525.7489.raw    0 funclist.log  12K modules.log

$ /usr/bin/time bin64/drrun -t drcachesim -offline -- ~/spec2006/mcf-test/mcf_base.gcc-64bit ~/spec2006/mcf-test/inp.in 
53.32user 2.89system 0:56.77elapsed 99%CPU (0avgtext+0avgdata 167848maxresident)k
0inputs+7528296outputs (0major+40156minor)pagefaults 0swaps
$ ls -sh ~/dr/git/build_x64_rel_tests/drmemtrace.mcf_base.gcc-64bit.527083.0507.dir/raw/
total 3.6G
3.6G drmemtrace.mcf_base.gcc-64bit.527083.2964.raw.sz     0 funclist.log   12K modules.log

Wow, snappy is 3x slower than no compression!

Profilng shows a lot of time computing the CRC checksums:

  Children      Self  Command          Shared Object        Symbol
+   59.88%     0.00%  mcf_base.gcc-64  [unknown]            [k] 0xc203000000000000
+   58.22%    58.20%  mcf_base.gcc-64  libdrmemtrace.so     [.] crc32c            
+   26.37%    26.36%  mcf_base.gcc-64  libsnappy.so.1.1.8   [.] snappy::internal::CompressFragment
+    6.21%     0.00%  mcf_base.gcc-64  [unknown]            [.] 0xc203000000000005
+    5.14%     0.04%  mcf_base.gcc-64  libdynamorio.so      [.] syscall_ready
+    4.88%     0.01%  mcf_base.gcc-64  [unknown]            [.] 0xffffffffbac0008c

Disabling the checksums and omitting them completely from the snappy chunks:

19.83user 2.75system 0:22.60elapsed 99%CPU (0avgtext+0avgdata 168752maxresident)k

That's 2.5x faster than with checksums! Though still 23% slower than no compression, for this SSD.

I ran on a desktop machine with an SSD: and it had similar results.

Since different storage destinations are different I think we should commit this snappy compression as an option.

We should also propose new snappy chunk types with data but no checksum.

derekbruening commented 2 years ago

I also implemented zlib compression Z_BEST_SPEED and ran the same mcf test:

94.48user 1.96system 1:36.81elapsed 99%CPU (0avgtext+0avgdata 169932maxresident)k
0inputs+4852984outputs (0major+3324minor)pagefaults 0swaps
$ ls -sh drmemtrace.mcf_base.gcc-64bit.548605.1183.dir/raw
total 2.4G
2.4G drmemtrace.mcf_base.gcc-64bit.548605.1785.raw.zlib     0 funclist.log   12K modules.log

That's 5.3x slower than no compression, though the output is 5.8x smaller.

derekbruening commented 2 years ago

A complication with snappy is that it allocates heap but does not parameterize its allocator. This makes it infeasible to use when statically linking drmemtrace into the application. Replacing the global default operator new could possibly solve this for some situations, but not uses where drmemtrace is linked into large C++ applications. Plus, it is generally not considered wise as it impacts other uses: and in fact when I tried it, debug build worked, but release crashed due to what looks like droption calls to delete where the call to new was not replaced (despite my trying all the array and nothrow varieties; did not try the new aligned ones).

derekbruening commented 2 years ago

That could be another proposed change to libsnappy: parameterize the allocator.

derekbruening commented 2 years ago

I did more substantial evaluations of snappy, snappy_nocrc, zlib, gzip, and lz4 on drmemtrace offline raw files during live tracing on SPEC2017 CPUINT. I ran the test datasets as anything larger, for most of them, won't fit on my local disk. Some (gcc, xalancbmk) are too small in their test workloads and probably should have been run with train. I was unable to run perllbench (looking into it) so it is omitted here. I couldn't fit x264's uncompressed trace so I don't have that number.

Generally, my findings are:

snappy is 2.38x slower than snappy_nocrc
zlib is 3.68x slower than snappy_nocrc
gzip is 4.36x slower than snappy_nocrc
no compression is 1.13x slower than snappy_nocrc, but is faster in 3, same in 2, slower in 4
snappy_nocrc is 33x slower than native (28x minus gcc + xalancbmk too-small)
without i/o at all (drop output): drmemtrace is 5.4x slower than native (if throw out too-short gcc + xalancbmk)
lz4 is 0.92x vs snappy_nocrc w/ 24% smaller output sizes; beats no-compression in all but 2 benchmarks

From this, I think we should enable -raw_compress lz4 by default as it is a win nearly all the time even on an SSD, and the size reduction in the output makes a big difference, even when only 5x as opposed to more like 10x for zlib.

We can still lobby for snappy_nocrc to be official, and it seems worth keeping multiple compression options in the code as there are different use cases that prioritize each.

Some of the raw data -- here's lz4 / snappy_nocrc:

      502.gcc_r  1.00 =    2.6 /   2.6
      505.mcf_r  1.11 =  202.6 / 181.9
  520.omnetpp_r  0.84 =   69.5 /  82.6
523.xalancbmk_r  0.95 =    5.7 /   6.0
     525.x264_r  0.95 =  606.1 / 640.0
531.deepsjeng_r  0.94 =  225.4 / 240.6
    541.leela_r  0.87 =  136.1 / 156.7
548.exchange2_r  0.70 =   89.7 / 128.3
       557.xz_r  0.90 =  137.6 / 152.4
Ave 0.917778

So slower on mcf but faster on everything else, by 8% on average. Size ratios:

      502.gcc_r  0.68 =    8.8 /  13.0
      505.mcf_r  1.00 =   28.0 /  28.0
  520.omnetpp_r  0.68 =    8.9 /  13.0
523.xalancbmk_r  0.88 =  516.0 / 587.0
     525.x264_r  0.75 =   79.0 / 105.0
531.deepsjeng_r  0.76 =   29.0 /  38.0
    541.leela_r  0.81 =   21.0 /  26.0
548.exchange2_r  0.46 =    8.8 /  19.0
       557.xz_r  0.86 =  846.0 / 989.0  <showing just largest xz>
Ave 0.758889

Time of lz4 vs none:

      502.gcc_r  1.02 =    2.6 /   2.5
      505.mcf_r  0.57 =  202.6 / 356.6
  520.omnetpp_r  0.94 =   69.5 /  73.6
523.xalancbmk_r  1.37 =    5.7 /   4.2
531.deepsjeng_r  1.24 =  225.4 / 181.3
    541.leela_r  0.81 =  136.1 / 168.9
548.exchange2_r  0.53 =   89.7 / 168.8
       557.xz_r  0.67 =  137.6 / 206.7
Ave 0.89375

Time of zlib vs snappy_nocrc:

$ paste /tmp/y /tmp/x | sed 's/=/ /g;s/,//g;' | awk '{printf "%15s  %3.2f = %6.1f /%6.1f\n", $3, $9/$29, $9, $29}'
      502.gcc_r  1.10 =    2.8 /   2.6
      505.mcf_r  4.35 =  791.4 / 181.9
  520.omnetpp_r  3.79 =  313.2 /  82.6
523.xalancbmk_r  2.87 =   17.2 /   6.0
     525.x264_r  4.33 = 2773.2 / 640.0
531.deepsjeng_r  4.15 =  998.8 / 240.6
    541.leela_r  3.91 =  612.6 / 156.7
548.exchange2_r  4.46 =  572.7 / 128.3
       557.xz_r  4.18 =  637.1 / 152.4
Ave 3.68222

Time of gzip vs snappy_nocrc:

      502.gcc_r  1.12 =    2.9 /   2.6
      505.mcf_r  5.19 =  943.7 / 181.9
  520.omnetpp_r  4.41 =  364.6 /  82.6
523.xalancbmk_r  3.20 =   19.2 /   6.0
     525.x264_r  5.06 = 3239.2 / 640.0
531.deepsjeng_r  4.80 = 1154.9 / 240.6
    541.leela_r  4.66 =  729.5 / 156.7
548.exchange2_r  5.63 =  721.7 / 128.3
       557.xz_r  5.21 =  793.1 / 152.4
Ave 4.36444

Size of gzip vs snappy_nocrc:

      502.gcc_r  0.50 =    6.5 /  13.0
      505.mcf_r  0.64 =   18.0 /  28.0
  520.omnetpp_r  0.52 =    6.8 /  13.0
523.xalancbmk_r  0.61 =  361.0 / 587.0
     525.x264_r  0.61 =   64.0 / 105.0
531.deepsjeng_r  0.58 =   22.0 /  38.0
    541.leela_r  0.54 =   14.0 /  26.0
548.exchange2_r  0.51 =    9.7 /  19.0
       557.xz_r  0.57 =  906.0 /1600.0  <showing just one of xz>
Ave 0.564444

derekbruening commented 2 years ago

lz4, like snappy, uses malloc without parameterizing its allocator. zlib does parameterize.

derekbruening commented 2 years ago

Using Hadoop Snappy is an alternative to using the regular framed Snappy with a no-checksum addition: https://github.com/kubo/snzip#hadoop-snappy-format

DynamoRIO / dynamorio

Add snappy compression to drmemtrace output #5427