Performance with large datasets

vigna commented 1 year ago

We're trying to build an MPH for a dataset of 25B strings in the format swh:1:cnt:8cff7880b408c38231b723b4c568367912a33cb2 (about 1TB of data) from the Software Heritage graph. Previously we were using GOVMinimalPerfectHashFunction from Sux4J, which would construct the map in about 5h. However, after 15h we had to kill the ph process as it was still unfinished. Can you guide us into the best parameters/way of calling the lib to build such an MHP? Or is it such a long construction time typical?

beling commented 1 year ago

Although FMPH (fmph::Function) and FMPHGO (fmph::GOFunction) can be built without holding keys in RAM, I have never optimized nor tested ph for such a scenario. So I cannot answer how much it should take. The build time of the FMPHGO also depends on the parameters. Therefore, to start with, I suggest trying to build an FMPH (fmph::Function), which can generally be built faster than an FMPHGO. You certainly need to try caching the keys, but to give a hint on how I suggest doing this, I would ask how much RAM you have?

PS. Congratulations on the great results of your research. I am a big fan of your papers.

vigna commented 1 year ago

The build time of the FMPHGO also depends on the parameters. Therefore, to start with, I suggest trying to build an FMPH (fmph::Function), which can generally be built faster than an FMPHGO. You certainly need to try caching the keys, but to give a hint on how I suggest doing this, I would ask how much RAM you have?

2T, but for several reasons we don't want to use it all. We'll try (actually, I'm not doing this, so I have to ask what they're using).

PS. Congratulations on the great results of your research. I am a big fan of your papers.

😊

beling commented 1 year ago

With so much memory it should be possible to build in memory, even directly using my mphf_benchmark application. In that case, please try pass the keys to standard input (1 key per line) and use the parameters "-s stdin -n -t multi fmph -l 100". It is also possible to replace -l 100 with, say, -l 200 to speed up the build at the expense of size.

beling commented 1 year ago

When using the ph API it is important to use a build configuration with a large cache_threshold, preferably usize::MAX. Using multiple threads can also be helpful. In addition, different trade-offs between size and build time of FMPHGO can be achieved with different goconf.

See: https://docs.rs/ph/latest/ph/fmph/struct.BuildConf.html https://docs.rs/ph/latest/ph/fmph/struct.GOBuildConf.html https://docs.rs/ph/latest/ph/fmph/struct.GOConf.html

vigna commented 1 year ago

We tried with the non-GO version, but we're currently at 24 hours of wall time and 414 hours of CPU time and still running. Memory usage is lower: 1.4TB (so barely more than the input loaded in memory). How much should be the construction time per key of the non-GO version?

beling commented 1 year ago

I should start with this remark: there must be no duplicates in the set of keys, otherwise the process will never end.

The low memory consumption (1.4TB) is in line with my expectations.

My benchmarks for long (71,5bytes/key) strings are here: https://dl.acm.org/doi/pdf/10.1145/3596453 at page 12, and they shows about 10^-7 seconds/key on a fairly slow home computer, which gives less than 2h for 50*10^9 keys. Although the construction algorithm has linear time complexity, due to the effects associated with memory caching by the CPU, in practice the construction time/key increases with the size of the data. I am not able to predict the effect for such huge data. The default key access method used in mphf_benchmark is unfortunately not optimized for large data sets in terms of caching friendliness.

You can try:

when writing your own program (this is not possible in mphf_benchmark): try passing keys in a vector (with ownership, not just a reference), see https://docs.rs/ph/latest/ph/fmph/struct.Function.html#method.with_conf or https://docs.rs/ph/latest/ph/fmph/struct.Function.html#method.from
check what happens when multi-threading is disabled ("-t single" flag in mphf_benchmark).

vigna commented 1 year ago

Ok, we'll check this. One suggestion tho: if you can upper bound the probability of failure to something abysmal, it's a good idea to stop and print a failure message. For example, all our MHWC stuff dies with an error message after three hypergraphs were cyclic because the probability for that to happen is negligibile—there must be a duplicated key.

vigna commented 1 year ago

The input reading of our code was broken—we are fixing that. Using mphf_benchmark works like a charm.

One possibility to reduce the memory footprint would be to compute and store in memory a 128-bit signature of each key and then do passes on the 128-bit signatures. There would be an additional rehashing round but it shouldn't impact construction time too much if done properly. In fact, it might make lookup faster as when you find a conflict you wouldn't rehash the whole key—just the signature.

Maybe you are already doing this internally tho.

beling commented 1 year ago

Good to hear. How long did the construction take? In the meantime, I am working on your suggestion to interrupt construction with an error when duplicates are detected with near certainty.

Yes, I am familiar with the technique with 128-bit hashes from your paper on RecSplit. FMPH(GO) does internally something similar. But it computes 64-bit hashes and has to do so on average about 2 times per key. These hashes are optionally cached (which takeś 64 bit/key). In the absence of caching, the hashes have to be computed more than 2 times/key, but then the build overhead drops to as low as about 7 bits/key (including output).

PS. Your research seems very interesting to me. I wonder if you would like to include me in a collaboration on this or some future project on a similar topic.

vigna commented 1 year ago

non-GO: 3 hours and 8.7GB
default_smallest GO: 13h and 6.6GB
default_bigger GO: 12.5h and 6.6GB
default_biggest GO: 12.4h and 6.6GB

So construction time it's faster than GOV in Java for non-GO, but slower for the smaller variants.

But you must do something different than RecSplit because with 64-bit hashes you'll have collisions after about 5 billion keys.

I guess you solve in the same way hash collisions and slot collisions—you move the colliding keys to the next block and change the hash function, so with a bit of luck the collision won't be there.

But this does not change the fact that you have to keep the input in memory, or scan several times the input. The 128-bit idea is that of reading exactly once the input and forget about it (the probability of collision with 128 bit hashes is negligible) by storing in memory the hashes. I understand that for 64-bit integers this is actually worse than the input, but for string keys works very well.

beling commented 1 year ago

Yes, of course you are right. I even wrote about this in my paper (in Section 5.2 and 8). In fact, I have the collisions for much less than 5 billion keys, because I reduce the hashes to a level size which is much smaller than 2^64.

The mphf_benchmark cannot build without keeping all keys in memory, but the ph library already does. There are several possibilities: 1) It can scan the keys several times, potentially copying a fraction of them to memory when their number falls below a given threshold (number of keys decreases at a geometric rate); see https://docs.rs/ph/latest/ph/fmph/keyset/enum.CachedKeySet.html#method.dynamic 2) (the solution you suggest) Because a hash algorithm can be specified for FMPH(GO), you can replace the keys with 128-bit hashes and build FMPH(GO) for these hashes, using custom hash function (optimized for assigning 64-bit hash to: seed+128-bit hash) and disabling internal ph hash caching. There is no high-level support for this in ph, but it is still possible to use ph in this way by writing a wrapper made up of literally a few lines of non-difficult code.

Note that I have not yet tested any of these solutions.

vigna commented 1 year ago

Yes, of course you are right. I even wrote about this in my paper (in Section 5.2 and 8). In fact, I have the collisions for much less than 5 billion keys, because I reduce the hashes to a level size which is much smaller than 2^64.

That estimate (5B) is only for the hash collision, not for the slot collision, of course.

It can scan the keys several times, potentially copying a fraction of them to memory when their number falls below a given threshold (number of keys decreases at a geometric rate); see https://docs.rs/ph/latest/ph/fmph/keyset/enum.CachedKeySet.html#method.dynamic

Ok. We'll have a look. It would be a good idea to have examples of this in the documentation. Maybe CachedKeySet + DynamicKeySet is all we need.

beling commented 1 year ago

Ok. I'll revisit and test the code related to CachedKeySet and DynamicKeySet (in fact I haven't actually tested it). I think that if the size of the MPHF is not a priority then it is best to use large levels (e.g. relative_level_size=200) so that there is a quick reduction in the number of keys and you can copy the keys just after the first level (which is especially true for FMPHGO, especially default_smallest and default configurations). There is a chance that you would then only need to scan the keys from disk 2 times: 1st time to build the level, 2nd time to copy the selected keys (probably less than half of all the keys) into memory. However, an additional scan may be needed to find the number of keys (if this number cannot be determined by another method). Note also that FMPHGO has also a default configuration which probably offers the best trade-off.

I will reply to your email once I have deeply thought about its content.

beling commented 1 year ago

default_smallest GO: 13h and 6.6GB default_bigger GO: 12.5h and 6.6GB default_biggest GO: 12.4h and 6.6GB

I have looked at these results and find them hard to believe. The differences in time are way too small. No differences in size is impossible. I believe that the algorithm has been run 3 times with (almost?) the same parameters. Don't the default_* designations refer to the configurations included in ph?

vigna commented 1 year ago

Yeah, they should. Let me check.

beling commented 1 year ago

Setting relative_size=200 will speed up (relative to 100) the reduction in the number of keys, and therefore build and evaluation time, at the expense of structure size. For relative_size=200, the expected fraction of keys that will collide (and be re-processed) and the final size of the structure is: FMPH: 39.3%, 3.4 bits/key FMPHGO biggest: 26.5%, 3.15 bits/key FMPHGO bigger: 20.7%, 2.92 bits/key FMPHGO default: 10.4%, 2.86 bits/key FMPHGO smallest: 5.7%, 2.72 bits/key

beling commented 1 year ago

Ok. We'll have a look. It would be a good idea to have examples of this in the documentation.

Done. https://docs.rs/ph/latest/ph/fmph/keyset/enum.CachedKeySet.html#method.dynamic

beling commented 1 year ago

Ok. I'll revisit and test the code related to CachedKeySet and DynamicKeySet (in fact I haven't actually tested it).

Done. This code looks OK. I have made a few minor adjustments. Still small improvements seem possible, but nothing critical.

vigna commented 1 year ago

Setting relative_size=200 will speed up (relative to 100) the reduction in the number of keys, and therefore build and evaluation time, at the expense of structure size. For relative_size=200, the expected fraction of keys that will collide (and be re-processed) and the final size of the structure is: FMPH: 39.3%, 3.4 bits/key FMPHGO biggest: 26.5%, 3.15 bits/key FMPHGO bigger: 20.7%, 2.92 bits/key FMPHGO default: 10.4%, 2.86 bits/key FMPHGO smallest: 5.7%, 2.72 bits/key

I'm a bit confused. I thought that the fraction of colliding keys would be smaller for a structure with a larger footprint. I mean, more space, less collision. What am I missing?

beling commented 1 year ago

I'm a bit confused. I thought that the fraction of colliding keys would be smaller for a structure with a larger footprint. I mean, more space, less collision. What am I missing?

In FMPHGO, it is possible to improve the number of collisions and size at the same time, but at the expense of build speed (more seeds to examine). relative_size is a parameter that allows the number of collisions to be reduced at the expense of size. It applies to both FMPH and FMPHGO, and is independent of the other FMPHGO parameters.

beling commented 10 months ago

I believe the issue has been resolved. See also related issue #3 .

beling / bsuccinct-rs

Performance with large datasets #2