Open lemire opened 6 years ago
@lemire
did you have any thoughts about how to implement this efficiently? I haven't been able to come up with anything better than serializing and dumping to a hasher.
I played around with using iteration (including only iterating over lower bits) but that looks like it has more overhead right now than serialization. It seems like https://github.com/RoaringBitmap/roaring/issues/94 would help with this approach especially if it worked like ManyIntInterable
and provided ints in strides equal to what the hashing implementation expects.
@bpot
Maybe we should define the problem better. What do you think that the user API interface should be? Just a single function Checksum
with no parameter, or do you want to pass a hashing function?
I'm not @bpot but I think a simple function like Checksum
would be acceptable, if it returns the same content independent of the order (which always occurs because bitmaps are always "ordered") and optimizations applied. It would be even better if the returned number is something simple, like a uint64
or static byte array, so we can map roaring bitmaps by the checksum (I use this to avoid reduce the number of bitmaps on my project from ~1000000 to ~55000, which is quite significant, I think).
I tried two approaches on a project I'm working on that uses this package: Using WriteTo
to write to a customizable hasher (by default it's FNV 128a, already present on Go's standard library), or getting a iterator, serializing the numbers using binary.LittleEndian.PutUint32
and writing that byte slice to fnv
hasher. I feel the second approach is better because it's already quite independent of the roaring bitmap optimizations, and it avoids most of the internal buffering (which was already improved by #197, but I think it could be improved more) of the WriteTo
method, while suffering from some problems like those reported on #94, which today reports quite a big number of allocations on my project:
Type: alloc_objects
Time: Jan 11, 2019 at 1:04pm (-02)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 10
Showing nodes accounting for 67694584, 86.00% of 78717670 total
Dropped 224 nodes (cum <= 393588)
Showing top 10 nodes out of 80
flat flat% sum% cum cum%
18368351 23.33% 23.33% 36704635 46.63% github.com/RoaringBitmap/roaring.newIntIterator
18154685 23.06% 46.40% 18154685 23.06% github.com/RoaringBitmap/roaring.(*arrayContainer).getShortIterator
@fjorgemota You can probably easily avoid the iterator by delegating the production of a hash value to the containers. It would be faster and generate fewer allocations.
Yeah, that would be a good idea.
Is there any plan to ship this feature?
@azurenake there is no open pull request for that feature. Feel free to submit one.
It is not very hard and we have plenty of people who could review and help. It is a good issue for someone who has some time and wants to contribute.
Is there a way to do this in two steps? It would be interesting to first hash only the keys, which should be faster, and only if that hash matches hash the containers.
@Oppen A CheckSum typically returns one value. You could imagine having several different checksums... from less expensive to more expensive ones.
Of course. What I meant is if there's a way to use a several-checksums approach for keys in maps and the like. It wouldn't help a lot now that I think of it, tho, since storing a new key always requires computing both in the end.
Based on @fjorgemota suggested approach I made the draft PR #347. I still need to write tests and benchmarks, but there's an implementation to get the idea.
Added simple test and benchmarks:
$ go test -bench Checksum -run -
goos: linux
goarch: amd64
pkg: github.com/RoaringBitmap/roaring
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkChecksum/checksum-1-8 1000000000 0.0000004 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-1-8 1000000000 0.0000003 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-2-8 1000000000 0.0000003 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-2-8 1000000000 0.0000003 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-4-8 1000000000 0.0000003 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-4-8 1000000000 0.0000003 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-8-8 1000000000 0.0000004 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-8-8 1000000000 0.0000005 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-16-8 1000000000 0.0000005 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-16-8 1000000000 0.0000005 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-32-8 1000000000 0.0000006 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-32-8 1000000000 0.0000005 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-64-8 1000000000 0.0000006 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-64-8 1000000000 0.0000005 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-128-8 1000000000 0.0000010 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-128-8 1000000000 0.0000008 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-256-8 1000000000 0.0000011 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-256-8 1000000000 0.0000010 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-512-8 1000000000 0.0000016 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-512-8 1000000000 0.0000019 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-1024-8 1000000000 0.0000028 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-1024-8 1000000000 0.0000029 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-2048-8 1000000000 0.0000051 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-2048-8 1000000000 0.0000056 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-4096-8 1000000000 0.0000100 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-4096-8 1000000000 0.0000104 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-8192-8 1000000000 0.0000196 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-8192-8 1000000000 0.0000201 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-16384-8 1000000000 0.0000395 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-16384-8 1000000000 0.0000388 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-32768-8 1000000000 0.0000765 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-32768-8 1000000000 0.0000774 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-65536-8 1000000000 0.0001477 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-65536-8 1000000000 0.0001500 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-131072-8 1000000000 0.0001505 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-131072-8 1000000000 0.0001542 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-262144-8 1000000000 0.0001815 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-262144-8 1000000000 0.0001547 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-524288-8 1000000000 0.0001551 ns/op 0 B/op 0 allocs/op
BenchmarkChecksum/checksum-compressed-524288-8 1000000000 0.0001554 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/RoaringBitmap/roaring 0.150s
@lemire if you think those are OK I'll squash and make the PR a ready one.
@Oppen Looks good to me!!!
We need a Checksum method that is dependent on the logical contents of the bitmap (i.e. is the same regardless of run optimization).
cc @bpot