Just a curios, why CRC64?

nefelim4ag commented 6 years ago

Just a curios, why CRC64? i.e. AFAIK CRC without HW acceleration not fastest checksum, and as i see in code CRC64 are full SW.

Thanks.

Zygo commented 6 years ago

Why CRC64? It was a familiar algorithm that produced 64 bits of output.

Not the fastest, but perf says that other parts of bees take significantly more CPU time, so I'm working on those instead.

The hash must only be large enough that the collision rate (identical hashes for distinct data) is <=0.1% with 32TB of 4K blocks, i.e. a filesystem with 32TB of unique data will spend no more than 0.1% of the time comparing data from hash collisions. If the hash is any larger than the minimum size, it requires more hash table space, but the speed gain is at most 0.1%. Spinning disks require a few more bits to offset the high cost of random seeking.

A collision-free 50-bit hash (the sum of 33 bits for unique data blocks, 10 more bits to have a 0.1% collision rate, and 7 more bits assuming one disk seek for a false positive match is worth 128 linear reads) meets these requirements; however, odd-sized hashes don't align nicely on 64-bit platforms, and there are fewer than 16 bits unused from the original CRC64, so I just use CRC64 instead of doing all the messy bitshifting to pack 50-bit hashes into the hash table.

Other than the above, there's no requirements on the hash at all. Any algorithm could be used (though any change in algorithm requires building a new hash table).

Do you have a better candidate?

nefelim4ag commented 6 years ago

As btrfs make bit to bit compassion, i'm not really care about collisions it self, only on raw performance & cpu usage.

I use valgrind/callgrind for bees profiling (attached),

valgrind --tool=callgrind --dump-instr=yes  ./bin/bees /mnt/

And as i see CRC64 use 18% of run time, may be that difference with your info shown because it's first start on that FS,

If we talk about suggestions, (I use pycrc to gen table driven crc64) AFAIK, for now on software hash algo, that i know, best in speed for 4K pages are xxhash and spooky, i use some of my old code to gen numbers:

PAGE_SIZE: 4096, loop count: 1048576
jhash2:         0xb76eea9b              time: 1893 ms,  th: 2268.461853 MiB/s
crc32hw:        0x80673f86              time: 933 ms,   th: 4600.436264 MiB/s
xxhash32:       0xcc7cd5a9              time: 714 ms,   th: 6008.241350 MiB/s
crc64:          0x5863dbf1e3ac9dec      time: 2586 ms,  th: 1660.480494 MiB/s
xxhash64:       0x697a77c7e08f65ae      time: 363 ms,   th: 11821.086822 MiB/s
Spooky64:       0xd2e9745b2eb2f7a6      time: 342 ms,   th: 12552.768194 MiB/s
Spooky128:      0xd2e9745b2eb2f7a69807d8f800a0c490      time: 342 ms,   th: 12535.365749 MiB/s

Thanks.

bees_callgrind.gz

kakra commented 6 years ago

It was discussed before that bees could sacrifice granularity from 4k to 8k by using btrfs CRC directly from the csum tree, and previous measurements showed that blocks below 12k don't add much to the space savings while adding much to the time needed for deduplication. Since csum tree hashes are 32bit, and to still obtain 64bit hashes, the hashes of two consecutive blocks could be concatenated. Such an algorithm would still keep the 4k offset granularity by having overlapping 8k blocks. But there's also a plan to support nodatasum files which obviously have no entries in the csum tree. This is where crc32hw or xxhash32 could jump in, applying the same granularity logic.

I'm not sure if crc32hw and csum hashes from btrfs are compatible, and it is no requirement as nodatasum and datasum files cannot be deduplicated with each other - but having compatible hashes could still be nice.

I was asked to implement usage of the csum tree hashes, and I'm still planning to do so. But first I'd like to get comfortable with the code a little more. And I'd like to prepare for nodatasum deduplication also before going to csum tree hashes.

With that in mind, maybe the switch to 8k blocks composed of two 32bit hashes should be implemented first?

I'm not sure if it would be wise to use two different crawler implementations for datasum and nodatasum, i.e. use the 8k granularity csum tree approach for datasum files, and a 4k granularity approach for nodatasum files. Both have their benefits: csum tree will clearly provide much better runtime and use less storage bandwidth, the 4k granularity can get you higher deduplication rates. But the resulting code may be more difficult to maintain. I think @Zygo has some better clues on these thoughts.

Zygo commented 6 years ago

Now that we have an option parser, there should really be an option for this.

Using the csum tree has a severe negative impact on the ability to find several classes of duplicates, but it could be orders of magnitude faster on everything else. The csum tree stores checksums of data after it has been compressed, so it will never match duplicate data that is uncompressed, compressed with a different compression algorithm, part of a larger or smaller extent, differently positioned within an extent, or adjacent to different data. 4K extents and the 4K blocks at both ends of an extent are special cases that match only other similar special cases.

On the other hand, the total portion of data affected by all the above restrictions may only be around 10% of a typical user's filesystem (the other 90% being small single-extent files that match exactly and large uncompressed files where csum-tree scans work without all the above issues).

I don't see any way forward for bees that doesn't have both the brute-force and csum-tree scanners. The nodatasum case requires the brute-force hash implementation, so we can't get rid of it, and the csum-tree scanner offers a massive speedup, so we'd like to add it.

The crawler itself is the same in either case. The crawler just makes a list of "new" extents and submits them to scan. Different crawler implementations do that in different orders, sequentially or in parallel. The scanner function determines how to process each extent it is given, and it can easily decide whether to use csum-tree or brute-force algorithms based on the extent ref and inode data already available during the crawl.

The option comes in to choose whether to fall back to brute-force scan (using the same CRC32 algorithm that btrfs uses in the csum tree) in these cases where csum-tree scan is not usable:

compressed extents (if this is disabled, compressed extents can only be deduped under ideal conditions; if enabled, decompress the data and compute sums within bees)
nodatasum extents (if this is disabled, the extent is ignored because no sum data is available)
all extents (if this is enabled, brute-force scanner is always used, no csum-tree scan at all)

The option(s) would need to allow users to select from the above individually. e.g. "fastest" mode disables all the above at the expense of size, "smallest" mode enables all the above at the expense of speed, "compressed && !nodatasum" is the common case when performance of nodatasum workloads is important, and "compressed && nodatasum" is the common case when size is more important than performance.

kakra commented 6 years ago

@Zygo wrote:

The csum tree stores checksums of data after it has been compressed, so it will never match duplicate data that is uncompressed, compressed with a different compression algorithm, part of a larger or smaller extent, differently positioned within an extent, or adjacent to different data. 4K extents and the 4K blocks at both ends of an extent are special cases that match only other similar special cases.

This makes me wonder if btrfs csums are per block or per extent?

Zygo commented 6 years ago

csums are per on-disk block (which is not the same thing as a block read through the POSIX filesystem interface). The items in the csum tree are not extent-aligned, i.e. there will be a single item in the csum tree covering blocks from multiple adjacent extents.

This makes all the data within a compressed extent unavailable if any bit in the compressed extent is flipped, even if the data in the earliest blocks of the extent could still be correctly decoded.

It also means that there will have to be backward searches on the csum tree to line its data up with the extent tree, since an extent bytenr could refer to the middle of a csum item. Alas, TREE_SEARCH_V2 is not very good at backward iteration.

Massimo-B commented 5 years ago

Hi Zygo, on #btrfs@freenode you mentioned your promising experiments with cityhash, could you add these results here?

Zygo commented 5 years ago

cityhash > xxhash > crc64, saving both processing time and disk space.

2xCRC32C (like the function dduper uses) is still the decisive winner for hashing speed because that function is compatible with scanning the btrfs csum tree instead of the data blocks. It is not trivial to use that function in bees though--first, bees needs to be able to handle an 8K block size, and there are issues with compression to be worked out (it looks like dduper doesn't handle compression at all).

There are also some smaller gains from reducing the number of entries in each hash bucket, with the worst value being the current 256 and the best value being...1. So all that random-drop LRU logic in BeesHashTable seems to be slow and useless. I have no idea why making the bucket size smaller (or eliminating buckets entirely) is so consistently helpful, but I'm not opposed to drastically simplifying the BeesHashTable code to take advantage of it (assuming the other test cases show similar improvements).

I was expecting to save a little CPU time (most of the bees time is still spent in IO) but I was surprised by the space saving from the combination of these changes (77 GB for crc64 + bucket size 256, 84 GB for cityhash + bucket size 1). My guess is that changing hash algorithm helps because CRC64 hashes are less equally distributed than those of xxhash or cityhash, so some buckets overflow prematurely while others are underutilized.

All of these improvements require incompatible hash table format changes, so first I have to figure out how to manage those (probably something like https://github.com/Zygo/bees/issues/44#issuecomment-425142347, or add a separate hash table header file).

Here is a graph hot off my test machine, using 153GB of VM image files, mounted with compress-force=zstd where noted.

00-all-df-summary

The "b128" and similar numbers are hash table bucket sizes using cityhash and https://github.com/Zygo/bees/commit/d4b3836493213fa47c1c583102dc6fb0a86be930. The two long lines to the right are last week's HEAD revision. I also plotted duperemove and dduper (on compressed and uncompressed filesystems) for scale.

Zygo / bees

Just a curios, why CRC64? #33