gchq / Gaffer

A large-scale entity and relation database supporting aggregation of properties
Apache License 2.0
1.75k stars 353 forks source link

Collaboration with DataSketches.github.io #834

Closed leerho closed 7 years ago

leerho commented 7 years ago

@p013570

I can see that your Gaffer team is using a number of sketches from our library. Excellent. We also get a number of queries about use of sketches in graph databases. We can certainly point to Gaffer, but I think a closer collaboration would be beneficial. We have a number of new sketches in development, for example the VarOpt sketch for superior accuracy for weighted "reservoir" sampling. I would like to get some feedback on your interest in new sketch types as well as questions or comments on usability or documentation of our library. If you are interested, you can start a discussion on https://groups.google.com/forum/#!forum/sketches-user, or add an issue to https://github.com/DataSketches/sketches-core/issues. Thanks.

gaffer01 commented 7 years ago

Thanks for reaching out to us. We would be very interested to learn of new sketches that are in development - is there a list of these available anywhere? We would be keen to implement serialisers and aggregators for them so that they can be used as properties on nodes and edges (implementing these serialisers and aggregators is generally just a case of wrapping some methods from your library in classes that implement Gaffer interfaces).

We will also support using HLLSketch from your library, once this pull request has been merged and a version released: https://github.com/DataSketches/sketches-core/pull/142.

leerho commented 7 years ago

@gaffer01

Thanks for the quick reply. There is another sketch already in the library that we are using actively at Yahoo that you might be interested in using: The HllMap. See our website for more info. It is strictly for real-time tracking of unique IDs for many millions of keys. We didn't have the need to ser-de, but if that is important, let us know.

In addition to VarOpt, we have been considering PCA, K-means, and possibly wavelets, but generally moving into more vector-based sketching. We are a tiny team and prioritization is largely based on feedback and sound use cases (and published theoretical proofs of error bounds :) ) So your participation in our emerging community would be appreciated.

The PR DataSketches/sketches-core#142 is being actively worked on.

gaffer01 commented 7 years ago

Thanks. The HLLMap looks interesting. A typical use of HLL in Gaffer is to store a HLL on each node in the graph which allows for a quick estimate of the degree of a node. The only limit on the number of nodes is the limit of the underlying storage technology (e.g. Accumulo or Hbase). The comments on HLLMap say that it is optimised for many millions of keys - does it have any advantages if the number of keys is small, e.g. if there was a need to store several (<5) HLLs on each node, would HLLMap offer a storage advantage over just storing a map from key to HLL?

Experimenting with UniqueCountMap, it seems that the length of the key must be fixed? Is that an absolute requirement? In practice this could be a bit limiting.

As the documentation for HLLMap notes, for datasets where the number of values per key is power-law, the storage cost is dominated by the large number of keys which have a small number of values. It is crucial for the serialisation of a HLL which is almost empty to be very efficient. HLlSketch seems to serialise to 73 bytes even if nothing has been added to it. Is there any way to improve that? One option might be to have an object which accepts items but internally only actually creates a HLL once a certain number of items have been added, e.g. if only one string of length 10 has been added then it would be more efficient to store that than the HLL. If more items are added later then internally a HLL could be created and the previous items added to that.

leerho commented 7 years ago

@gaffer01

The comments on HLLMap say that it is optimized for many millions of keys - does it have any advantages if the number of keys is small, e.g. if there was a need to store several (<5) HLLs on each node, would HLLMap offer a storage advantage over just storing a map from key to HLL?

The short answer is no. Don't use the HllMap for small key counts.

I don't pretend to understand all of your use case specifics, but allow me to imagine a scenario where the HllMap might make sense: Suppose you have a potentially large graph with 100M nodes (perhaps IP addresses or Geo Where-on-earth IDs) and unique (user or edge) identifiers, which can be associated dynamically to a node in real time. You would like to be able to monitor, also in real-time, the number of unique (user or edge) IDs that is associated with each node.

Instead of having a sketch at each node, you have a central server (which receives all the incoming traffic) which would contain a single HllMap. Assume for simplicity, an identifier is a fixed length number (like an IP-address or a geo-identifier). The HllMap is updated dynamically with each incoming node ID and user ID. Since it is a map, you can query the HllMap at any time and get the unique count of user ID.

Now it is also possible to scan the entire map (in a single pass) at any time using a Quantiles Sketch and obtain the PMF or CDF distribution of Users per node. You could also scan the map with the Frequent Items Sketch and get the list of IDs of the heavy-hitter nodes.

The economy of storage comes with the assumption that the distribution of edges (or users) per node is very skewed (as is usually the case with human traffic) and that the overhead of the machinery of the sketch is amortized because it is all in one instance. Such an HllMap sketch would typically be large (on the order of a GB or so, but when you compute the cost per node it is far smaller than having a sketch per node. In our applications with 100M nodes and our traffic distributions, after subtracting the storage required for the keys, the storage allocated for unique counting per node is about 10 bytes. And the accuracy (RSE) (using this specific HllMap) is about 2.6%. Which was more than good enough for our application.

it seems that the length of the key must be fixed.

Yes it is true. The HllMap was designed for fixed length keys (identifiers), which was critical for efficient map storage and speed. You could use strings converted to bytes, assuming the string keys have a maximum length. Having arbitrary length keys would make the entire HllMap much much larger and much slower.

HLlSketch seems to serialize to 73 bytes even if nothing has been added to it.

The HllSketch has a number of configuration options, but let me look into it. I think it can be even smaller, but let me get back here with some specifics.

leerho commented 7 years ago

@gaffer01

One more comment:

It is crucial for the serialization of a HLL which is almost empty to be very efficient.

Absolutely. And this is how all of our sketches have been designed.

All of our sketches have a "warm-up" mode where the sketch starts out very small and grows larger only as required by the data. An empty Theta Sketch is only 8 bytes stored.

Some sketches reach a maximum size that is never exceeded (the (Theta, Sampling, and FrequentItems sketches), while others may not have a maximum size (Quantiles and HllMap). The Quantile Sketch grows logarithmically and for most use cases is much smaller than a Theta Sketch, which has an upper bound in size. The HllMap grows, of course, largely based on the number of keys.

The HllSketch starts out with a tiny hash table of "coupons" and then expands to the full HLL array only when it makes sense. The HllMap also has a "warm-up" strategy that has multiple stages in order to keep the overall storage very efficient, but it's start-up size is considerably bigger than the other sketches.

leerho commented 7 years ago

@gaffer01

I just noticed in your comment above that you may be assuming that the HllMap internally uses the same HllSketch available in the library. It does not. The form of the HLL sketch internally has been specially designed for this mapping application. It is a bit too complex to describe here, but but of all the different HLL sketches in the library, the one in the HllMap is the most efficient during warmup ( I. E. When the unique counts are very small. For example, early on, each entry adds only the key and 2 bytes for each new unique update.

gaffer01 commented 7 years ago

Thanks for all these comments.

I can see the advantage of HllMap. Being able to derive other sketches from it is really nice as well. Within Gaffer, nodes can have multiple properties, and the underlying store typically serialises these into a value, with the key being the serialised node. To use HllMap within a store we'd need to store some of the properties on the node, and all the HLLs in the HllMap. It's not impossible to do this, but it doesn't fit into the current Gaffer stores at the moment. The HllMap could of course be used in a streaming setting before data reaches the graph.

The HllMap was designed for fixed length keys (identifiers), which was critical for efficient map storage and speed.

The fixed size of the key is understandable. It makes it a little harder to use as we support arbitrary objects as our nodes, so we'd have to hash those into a fixed length byte array to generate the key, ensuring that the array is long enough to avoid collisions, but that's ok.

All of our sketches have a "warm-up" mode where the sketch starts out very small and grows larger only as required by the data.

Thanks - yes, we noticed that the sketches increase in size slowly, and this is very useful for conserving space.

The HllSketch has a number of configuration options, but let me look into it. I think it can be even smaller

If there was a way of making HllSketch serialise to smaller than 73 bytes when it's close to empty, that would be useful. If not, the idea of creating a wrapper class that stores a set and only creates a HllSketch when the size gets bigger than say 2, may give us all the benefits plus smaller serialisations for sketches that are nearly empty.

of all the different HLL sketches in the library, the one in the HllMap is the most efficient during warmup

If the HllMap is more efficient in space usage for HLLs with a small number of entries, then is it possible that we should use HllMap with a single key in the use-case of estimating the degree of a node, where we expect the majority of nodes to have low degree? (I assume not based on your earlier comment of not using HllMap for small number of keys.)

We didn't have the need to ser-de, but if that is important, let us know.

It would be very useful if HllMap supported SerDe - it makes it much easier to create one from a significant volume of data, experiment with it, save it, return to it later.

leerho commented 7 years ago

@gaffer01

I realized that my references above to HllMap should actually be referring to the enclosing class UniqueCountMap, which incorporates HllMap among other maps. See code and Overview. This is because within our team when we refer to the UniqueCountMap we call it the HllMap sketch because the key internal algorithm is HLL and it is a Map. But the actual class that pulls it all together is the UniqueCountMap. Sorry about the confusion, my mistake.

leerho commented 7 years ago

I want to return to your question about space usage and introduce some refinements in our terminology with respect to sketch size and memory space utilization.

Sketch size Most of our sketches have a configuration parameter k, or NominalEntries, that affects both the memory requirements of the sketch as well as its accuracy. k generally refers to the number of entries that the sketch needs to retain in order to guarantee related accuracy bounds. The size of an entry however, depends on the chosen sketch. For the Theta Sketches all update values are hashed to a 64-bit long, so an entry is 8 bytes. For the Quantiles, FrequentItems, and Sampling sketches an item can be some arbitrary object defined by the user. And since these objects are retained by the sketch, the overall memory consumed is directly dependent on the item size.

Updatable vs Compact (immutable) Memory Size
Most of the sketches have (at least) two forms of internal data structures. The Updatable form uses either a set of lists or a hash table that expands as required by incoming data and thus requires some extra overhead in space to enable high performance updating. The Compact form is designed for read-only operations where no more updating of the sketch is anticipated. In this form the extra space overhead required for an updating sketch can be eliminated. Depending on the sketch, and the internal state, the Compact form can be half the size of the Updatable sketch. When multiplied by millions or billions of sketches being stored, a 50% savings is space required can be substantial. Some of the sketches allow the user to specify during serialization, whether the resulting image is in the updatable or compact form.

On-heap vs Off-heap sketches. In large system environments where a single machine can have 500GB RAM and 24 cores, much of that RAM space is often dedicated to memory-mapped files for fast query performance. When the sketch image is stored off-heap it is usually in compact form, data only, which means there is no Java Object overhead included. What remains in the Java heap is a "Direct" Sketch that is just a small shell wrapper that has a pointer to the off-heap sketch image of the data that it uses to query results.

Currently only two of the sketch families, Theta and Quantiles, have off-heap (Direct) forms that actually can be updated off-heap. Some of the other sketches can support operations via a "heapify" operation, which brings the off-heap data on-heap.

Now back to the question about the serialization size of the HllSketch. If an empty HllSketch has been compacted, it will serialize to 9 bytes. If not compacted, (i.e., in updatable form) it serializes to 86 bytes. This is because to be updatable it reserves space for 16 potential "coupons" (the temporary form of the HLL entries prior to the full HLL table being created) plus a few other registers required for the update process. The first 11 or 12 entries just fill the potential coupon space thus the space required does not increase. Meanwhile, if compacted, each of these entries add 4 bytes to the compact form. So the stored, compacted images would have a size of 9, 13, 17, 21, ... bytes until the sketch converts to the full HLL array. Then the size depends on the chosen logK for the sketch and whether the compressed form of the sketch (4-bit buckets) or naive dense form (8-bit buckets) is chosen.

I hope this helps.

gaffer01 commented 7 years ago

Thanks again for these comments.

It makes sense that the updateable form of the sketches requires more storage space than the compressed ones. The challenge for our framework is that it supports aggregation of properies (e.g. merging of two HLL sketches), and that merging can happen a long time after the sketch was last updated. Therefore we need to be able to serialise the sketch, leave it stored in the underlying key-value store for a long period, and then at some point deserialise it, merge the new item in, and then reserialise it. Of course, this is inefficient in comparison to simply leaving the sketch in memory, but that's not always possible.

Once the HllSketch has been compacted, I understand that it is not possible to update it, i.e. add new strings into it. Would it be possible to take the union of 2 sketches though? i.e. to be able to do h3=h1.union(h2) where h1 and h2 are both compacted sketches? It seems that is is not possible now - is there any chance it could be possible in future? If it could then we would get the advantages of the compact serialisation and retain the ability to merge 2 sketches later.

leerho commented 7 years ago

@gaffer01

What you are asking for, the ability to union two compact sketches, is reasonable and is currently possible with the Theta Sketches, but is a bit more complicated with the current HLL sketches.

The HLL sketches were designed several years ago, and haven't received much use or attention primarily because HLL sketches lack the set flexibility and speed of the theta sketches, lack good documentation and are more difficult to use. Largely based on your interest and feedback here, we have decided to rewrite the HLL code, simplify its use, make the API more parallel to the other sketches, and incorporate some more advanced estimators that will make them more accurate as well. The completion of this will be probably a few months out.

Meanwhile, you could help us understand your computational environment in order to make sure your use-case is considered as we trade-off the various parameters that determine sketch performance. For example, you mention that you serialize and store your sketches in some secondary key-value store... So what are the performance bottlenecks for sketches in this key-value store? Serialization time, or storage space, etc ? What about on-heap: update time, merge time or serialization time or on-heap space? Your choice of HLL, leads me to think that serialized storage space is your biggest concern and not update time, merge time or serialization time. Is this correct? I also gather that set expressions are not of much interest, as they are not practical with HLL sketches and are addressed with the Theta Sketches.

gaffer01 commented 7 years ago

@leerho Thanks again for these comments.

When using sketches to estimate the degree of a node, the primary concern is compact serialisation, as there will be one sketch per node (or one per node per time window if the properties are stored keyed on some time window). The creation, update, and serialisation speed, and on-heap space, are of secondary importance. As discussed before, in a power-law graph the majority of nodes have low degree so compact serialisation for the low cardinality case is particularly important. Ideally the sketch should provide accurate estimates for both low cardinalities and high cardinalities. It's worth noting that for large-scale graphs a small sketch size might be chosen at the expense of accuracy, i.e. prioritise small serialised size over the accuracy of the sketch.

If the HLL implementation could be extended so that it allowed merging of the compact sketches, that would make it a lot more useful for this application. Presumably by "advanced estimators" you're referring to the improvements in the HyperLogLog++ paper? Look forward to seeing the improved HLL code in due course - please raise an issue here if you want us to test it.

Set intersections using theta sketches are of interest, and are supported by our framework - theta sketches can be added to nodes or edges. But serialisers and aggregators to allow these to be used have only recently been added so it's not yet clear how widely they will be used. One use case for the theta sketches would be to understand the evolution of the graph over time. For example, maintain a theta sketch of the edges added to the graph for each day, and then perform sketch intersections to understand what proportion of edges added one day had already been added the previous day. This use case only requires one sketch to be stored per day, so serialised space is not important, but update speed would be very important as the sketch would be updated many times.

leerho commented 7 years ago

@gaffer01

This is very helpful insight into your requirements.

compact serialization for the low cardinality case is particularly important.

We agree, and we have taken great pains to make this true.

Ideally the sketch should provide accurate estimates for both low cardinalities and high cardinalities.

For all sketches (HLL, Theta, Frequent Items, Quantiles, Sampling) the accuracy estimates for low number of entries are better than when the sketches transition into full estimation mode. In fact, in most cases, the estimation errors are zero up to some threshold, which is a function of the configured k. HLL is more complex as it has multiple estimators based on cardinality zones. The lowest zone (up to a cardinality of about 8 or 16 has practically zero error.

It's worth noting that for large-scale graphs a small sketch size might be chosen at the expense of accuracy, i.e. prioritize small serialized size over the accuracy of the sketch.

Two comments here.

If the HLL implementation could be extended so that it allowed merging of the compact sketches, that would make it a lot more useful for this application.

That is always the plan.

Presumably by "advanced estimators" you're referring to the improvements in the HyperLogLog++ paper?

The HLL++ papers refer to compressing the HLL encoding array down to 4-bits, which can be done several ways. But we haven't found any really good (or correct) implementations out there. We have come up with what we think is superior in many ways to the HLL++ paper, both in terms of accuracy, size and speed performance. We also leverage the HIP estimator, when appropriate, which was proposed by Edith Cohen a few years ago. We also have developed advanced estimators for the low and intermediate cardinality zones that has not yet been published (coming soon).

One use case for the theta sketches would be to understand the evolution of the graph over time.

This is what we call retention analysis, and we have it in production in many of our systems, where the time interval resolution is often hours, days, weeks or months, chosen by the user.

One application you didn't mention, which we have used to great success, is near-real-time use of sketching. We currently have systems deployed with real-time query results within seconds of real time.

gaffer01 commented 7 years ago

So if merging of mixed k sketches is very common, you may as well choose space or accuracy to determine k, and stick with it for all scales.

Generally, the same k is used for all the sketches in an application.

if your cardinality distribution is power-law, you might consider using the p-sampling option of the Theta Sketches

Thanks - we weren't aware of that option. We should investigate that.

We also have developed advanced estimators for the low and intermediate cardinality zones that has not yet been published (coming soon).

Sounds interesting - look forward to seeing the publication.

gaffer01 commented 7 years ago

I'm going to close this issue now. Thanks for the conversation. Please re-open it if you wish to continue the discussion.

leerho commented 7 years ago

No Problem. Working hard on new HLL impl, will let you know when it is ready for you to try out.

leerho commented 7 years ago

@gaffer01

The new HLL code is now in sketches-core/master in the com.yahoo.sketches.hll package. I suggest that you clone & jar it yourself. You will also need to pull in the new memory repository. This is in pre-release form but it should be good for testing.

It is a little short on documentation and we are doing characterization testing on it now.

We would like you to test it out and feedback to us what is good and bad, what works and what doesn't

Cheers and thank you for volunteering to do this!

gaffer01 commented 7 years ago

@leerho Thanks, we'll take a look soon.

gaffer01 commented 7 years ago

Reopening so that it's easier to see the discussion is on-going.

gaffer01 commented 7 years ago

@leerho We've briefly experimented with the new version of HLLSketch. Here are some simple observations - these mostly won't be a surprise to you:

leerho commented 7 years ago

@gaffer01

Thank you! This is really excellent feedback! I will respond to each of your points:

  1. & 2. Yes. Internally, the sketch starts as a simple list of "coupons", then transitions to a hash table of coupons, then at ~ 3/4 * 2^(log2(k) - 3), which is about 10% of k, transitions to the full HLL array, which thereafter is constant in size (except for the 4-bit version, which in very very rare cases might need to grow a few percent. The word "coupon" comes from the famous "Coupon Collector Problem", which is very close to how an HLL sketch behaves early on. Our coupons are 4 bytes each.

  2. & 4. Not perfect, but better than anything in the literature. This is our new estimator for the HLL low range. The RSE is less than 50 ppm and is independent of K and independent of the Target HLL Type. BTW, the accuracy for all 3 types is the same for the same LgK, as they are all isomorphic to each other. The accuracy of unioning is also independent of the incoming sketch type, and you can union sketches of different lgK.

  3. No way to deserialize a union. Oops! This is an omission! (This is why this is pre-release :) )

  4. The generally accepted practice for rating a sketch is to measure its accuracy per byte of required storage. So I could argue that comparing a sketch with a lgK = 4 against a sketch of lgK = 7 is a bit misleading and difficult to make sense of. However, I think I understand why you were motivated to do it this way. We can certainly look at extending the low range down to 4, but would you actually use a sketch with an RSE of 25% error? Or more practically, 50% error with 95% confidence?

If you don't mind, the next time you run this, try the two sketches, both with LgK = 7.

  1. The large error at 100K, I can't explain. It might be a bug, but we will definitely look into it.

  2. When feeding stings into the DS, there is a feature you could leverage. First convert the string to a char[] and feed the char[] to the sketch. This avoids the expensive UTF_8 encoding in Java.

  3. We will fix the javadoc.

  4. You found a bug. Thanks!

gaffer01 commented 7 years ago

If you don't mind, the next time you run this, try the two sketches, both with LgK = 7.

I've rerun the comparison between DataSketches and clearspring using HyperLogLogPlus(7) and HLLSketch(7, TgtHllType.HLL_4):

So DataSketches is more accurate - by a very large amount for size 200000, by a small amount for size 100000 (but as discussed the 12% error seems anomalous) and by a moderate amount at size 1000000.

would you actually use a sketch with an RSE of 25% error? Or more practically, 50% error with 95% confidence?

I think there are use-cases where the sketch is used to provide an estimate of the rough order-of-magnitude of the size, i.e. it's an indication whether the size is ~100, or ~1000, or ~10000, or.... In this case a 25% error may be acceptable.

First convert the string to a char[] and feed the char[] to the sketch.

This does indeed speed it up so that it's twice as fast as clearspring.

leerho commented 7 years ago

One more question. When running your tests, are you averaging over multiple trials? For example, run 10K trials of 100K uniques, each trial with a different set of uniques.

AlexanderSaydakov commented 7 years ago

12% error seems anomalous

This is well within 95% confidence interval, which is about 18% for HLL(7).

gaffer01 commented 7 years ago

@leerho OK, a slightly more intelligent test, averaging over many trials now gives approximately 5-6% error for HllSketch(7, TgtHllType.HLL_4), and this is consistent for sets of size 100000, 200000, 1000000.

@AlexanderSaydakov Thanks - the 12% error seemed anomalous in comparison to the other errors, but as you say it's well within expected bounds.

gaffer01 commented 7 years ago

@leerho For info, PR https://github.com/gchq/Gaffer/pull/986 uses another sketch from your library. The PR contains properties to store sets of timestamps on nodes or edges. One of the properties allows the user to specify a maximum number of timestamps to store; if more than that number are added it uses a ReservoirLongsUnion to maintain a uniform random sample of them.

leerho commented 7 years ago

@gaffer01

Cool, thanks! It really helps us to understand how these are being used.

In the next day or so, we will do a major release, which will include the improvements you suggested for the HLL sketches. Most significantly for HLL, it can now be configured with a lgK down to 4. The Javadocs have been improved and the bug you found fixed.

The scope of this release is large impacting a number of areas of the library. In the following week we will be updating the web-site, so stay tuned.

leerho commented 7 years ago

@gaffer01

The new sketches-core_0.10.0 has been released to Maven Central. Please see the release notes and the web-site for more documentation (which will be rolling out over the next few days).

gaffer01 commented 7 years ago

@leerho Great, we will check out the new version over the next few days.

gaffer01 commented 7 years ago

@leerho We've looked at the new HLL, and the accuracy seems great. We're working on a serialiser and an aggregator for it so that it can be used within Gaffer graphs (issue https://github.com/gchq/Gaffer/issues/1011).

Here's a summary of the use of sketches from Data Sketches in Gaffer - there are a couple of questions in there to check we're using it in the optimal way.

For any sketch that we use we need to be able to serialise it to a byte array for storage. Later we need to be able to deserialise it (a) back to the sketch to obtain whatever estimates the sketch provides, (b) to merge it with another deserialised sketch. Of course this serialisation and deserialisation can be relatively expensive, but as some of our stores (e.g. the Accumulo and Hbase ones) are designed for large-scale storage, it is not always viable to leave the sketches in memory. Streaming jobs can be used to aggregate sketches over a time window before they are serialised and stored, but the requirement to deserialise them later and possibly merge with other sketches of the same type remains.

Sketch: HyperLogLog Typically used to obtain quick estimates of the degree of a vertex (by putting a HyperLogLog sketch on each vertex - the HLL contains all the vertices at other ends of edges from the vertex). The com.yahoo.sketches.hll.Union class is used for this. It is serialised with the toCompactByteArray() method, deserialised with Union.heapify(bytes), and two Unions a and b are merged using a.update(b.getResult()). Question: Would it be more efficient to use the static method unionImpl on Union to merge the two Unions (but this method is not public so presumably it is not intended to be used by external code)?

Sketch: ReservoirItemsUnion Example use-cases: Maintain a random sample of some property on a node or edge. This is used in the BoundedTimestampSet class to maintain a random sample of the timestamps a node or edge was seen active. It could also be used to store a random sample of the neighbours of a vertex on the vertex. Classes ReservoirLongsUnion and ReservoirItemsUnion are used for this. They are serialised with the toByteArray method, deserialised with heapify(WritableMemory.wrap(bytes), SERIALISER) method, and two Unions a and b are merged using a.update(b.getResult()). Question: is this the best way? Should it be done with a.update(Memory.wrap(b.toByteArray()))?

Sketch: Quantiles Example use-cases: Estimate the percentiles of some numerical property that is associated to a vertex or edge. Classes ItemsUnion and DoublesUnion are used for this. They are serialised with union.getResult().toByteArray() method (it seems that there is no toByteArray directly on Union, unlike the reservoir sampling package?), deserialised with DoublesUnion union = DoublesUnion.builder().build(); union.update(WritableMemory.wrap(bytes)); and two Unions a and b are merged using a.update(b.getResult()).

Sketch: Frequencies Example use-case: Estimate the frequency with which a long property is seen on a vertex or edge. Classes ItemsSketch and LongsSketch are used for this. They are serialised with toByteArray(), deserialised with getInstance(WritableMemory.wrap(bytes)) and two sketches a and b are merged using a.merge(b).

Sketch: Theta Example use-case: Estimate the number of distinct edges seen on Jan 1st, the number seen on Jan 2nd and the number in common between those two days. Class com.yahoo.sketches.theta.Union is used for this. It is serialised with union.getResult().toByteArray(), and deserialised with Union union = Sketches.setOperationBuilder().buildUnion(); union.update(WritableMemory.wrap(bytes));, and two sketches a and b are merged using a.update(b.getResult()).

leerho commented 7 years ago

@gaffer01

This is great feedback, thank you! I will make some comments and try to answer the questions you have:

For any sketch that we use we need to be able to serialise it to a byte array for storage. Later we need to be able to deserialise it (a) back to the sketch to obtain whatever estimates the sketch provides, (b) to merge it with another deserialised sketch. Of course this serialisation and deserialisation can be relatively expensive, but as some of our stores (e.g. the Accumulo and Hbase ones) are designed for large-scale storage, it is not always viable to leave the sketches in memory. Streaming jobs can be used to aggregate sketches over a time window before they are serialised and stored, but the requirement to deserialise them later and possibly merge with other sketches of the same type remains.

Sketch uses after deserialization. You mentioned (a) get estimate and (b) merging, but you didn't mention (c) continue updating the sketch. Do you ever have this 3rd use case? If so, and depending on the sketch, one must pay attention to the serialized form. More on this later if you need it.

Serialization & deserialization costs. What I can't tell from your comments is your requirements on query latency. We also have massive data that has been stored in sketch form (on disk) that goes back several years. Based on the breadth and depth of the query, of course, even in sketch form it can sometimes take a minute (from a Hadoop environment) to bring the relevant sketches (sometimes millions) into memory so they can be merged with more recent sketches. Still, careful profiling of the queries and staging of history allows us to keep more than 95% of the query latencies under a second.

This query latency is also very sensitive to the design of your back-end store. While Hadoop (HBase) offers large distributed storage, for some of our most critical latency requirements we find HBase too slow. For our real-time systems, we use a combination of Storm and Druid. A typical Druid cluster might have 100 nodes, each node with 256- 512GB RAM and 48 hyperthreads. All of this RAM is used for data and the JVM, which acts as a supervisor, may only be 8GB. The data in RAM is paged in as required from SSDs. This is where the Memory package from our library comes into play. So far, two of our sketches, Theta and Quantiles, take full advantage of this by being able to operate directly from off-heap memory. (We will be adding an Direct (or off-heap) extension to the HLL sometime soon.). This largely eliminates the need to be constantly serializing and deserializing the sketches. It takes quite a bit of care and engineering at the system level to make this all work, but the real-time query results are truly amazing. In this environment we routinely merge millions of sketches and provide sub-second query latency.

Question: Would it be more efficient to use the static method unionImpl on Union to merge the two Unions (but this method is not public so presumably it is not intended to be used by external code)?

There is no UnionImpl method in the HLL Union. Perhaps you captured a pre-release version that had that method (perhaps I called it that at one point, but now it is called 'gadget'). Nonetheless, the non-public members behind the HLL Union must not be directly accessed by users for good reasons, one of which is that these members can be swapped out entirely during the unioning process, so attempting to keep a link to these private members would be futile. And no, it would not be faster, and even if you could make it work, it could be catastrophically slower.

It seems that there is no toByteArray directly on Union, unlike the reservoir sampling package?

I will talk to my colleague who worked on this and get back to you. We had a good reason, I just don't recall at the moment the details.

The rest of your use-cases seem well thought out.

AlexanderSaydakov commented 7 years ago

Could you help us understand why you are associating union objects instead of sketch objects with your vertices or edges? Unions are typically larger than sketches. In case of HLL family, unions are always in 8-bit mode, but sketches can be in 4-bit mode, which is the most compact mode, and is the default mode. In case of Theta family, Union contains an UpdateSketch as its internal state. Do you update your vertices and edges? If not, the associated Theta sketch can be in the compact form.

By the way, the Theta family has PairwiseSetOperations if you just want to produce a union of two sketches as opposed to performing a massive union operation on thousands or millions of sketches, for which Unions are designed.

gaffer01 commented 7 years ago

@leerho Thanks.

Sketch uses after deserialization: You mentioned (a) get estimate and (b) merging, but you didn't mention (c) continue updating the sketch. Do you ever have this 3rd use case?

No we don't generally have this 3rd use case - once a sketch has been serialised, it only changes when it is merged with another sketch that has been deserialised (normally as new data for a vertex or edge arrives, it will be serialised and later merged with older versions of that vertex/edge).

In this environment we routinely merge millions of sketches and provide sub-second query latency.

Those figures are impressive. We generally merge much smaller number of sketches at query time, e.g. merging sketches associated to the same node for different time windows. Here it's only a few 10s of sketches that are being merged.

There is no UnionImpl method in the HLL Union.

I was looking at the unionImpl static method on line 199 of https://github.com/DataSketches/sketches-core/blob/sketches-core-0.10.0/src/main/java/com/yahoo/sketches/hll/Union.java, but you've answered my question anyway thanks.

gaffer01 commented 7 years ago

@AlexanderSaydakov Thanks for asking that question (i.e. why do we use Unions not sketches). It's a fundamental question and probably worth revisiting.

Our main requirement is to be able to aggregate two serialised properties together, where the property is either a sketch or a union. As we use a union, to aggregate two of them a and b, we can just do

a.update(b.getResult());

If we were using sketches we would have to do something like:

ReservoirLongsUnion union = ReservoirLongsUnion.newInstance(10);
union.update(a);
union.update(b);
ReservoirLongsSketch merged = union.getResult();

In the first approach, we need to call getResult() once and update once. In the second we need to create a new Union, call update twice and then call getResult().

Although we will often be merging more than two of these together (although generally only a few of these per node or vertex, not millions), the aggregation API requires the result of the aggregation to be of the same type as the things being aggregated. So when using sketches (2nd code snippet) we would have to repeatedly get the result from the union, and then create another union when the next item for the same vertex/node needed to be merged.

There was also some issue, which I can't remember in detail, about some of the sketches not being updateable once they have been serialised to a compact form.

If this doesn't make sense, or you think that using Unions is the wrong approach, please let us know.

AlexanderSaydakov commented 7 years ago

In the first approach, we need to call getResult() once and update once. In the second we need to create a new Union, call update twice and then call getResult().

The first update of an empty union is simple. Most probably there is no performance penalty compared to the second approach. It seems to me that this perceived simplicity of the first approach doesn't justify larger serialized blob. Serialized unions are larger (sometimes much larger) than sketches. Especially if sketches can be stored in a compact form.

There was also some issue, which I can't remember in detail, about some of the sketches not being updateable once they have been serialised to a compact form.

As I understood from your other comment "once a sketch has been serialised, it only changes when it is merged with another sketch". So converting sketches to a compact form must be desirable.

jmalkin commented 7 years ago

@gaffer01 We discussed adding toByteArray() to quantiles unions and decided we can support that. It'll ultimately just serialize as a sketch, but I should be able to avoid an extra copy due to calling getResult() (not 100% sure for the ItemsSketch, but I think so).

leerho commented 7 years ago

@gaffer01

I was looking at the unionImpl static method on line 199 of https://github.com/DataSketches/sketches-core/blob/sketches-core-0.10.0/src/main/java/com/yahoo/sketches/hll/Union.java, but you've answered my question anyway thanks.

I am red-faced on this one, and I wrote it!

This static method examines the state of the current internal gadget and the incoming sketch and determines the optimum way to perform the union. This may involve swapping, down-sampling, transforming, and / or copying one of the arguments and may completely replace the internals of the union.

The result is always in HLL_8 form, which may not be what the user wants. In general, it is not safe for a user to have a reference to the internal data structures, which could be changed by the user, or the reference may get completely replaced by later union operations.

gaffer01 commented 7 years ago

@AlexanderSaydakov Thanks. We'll write serialisers and aggregators for both the unions and the sketches. Then users can decide which to use, and in slower time we'll do some investigation to see if there are any significant performance differences between the two approaches.

gaffer01 commented 7 years ago

@leerho I've started experimenting with storing the sketches off-heap and am beginning to appreciate the power that offers. We will be interested when the HLLs support off-heap storage.

jmalkin commented 7 years ago

@gaffer01 Not sure when we'll do an official minor release, but in master we've merge the PR so quantiles unions can support toByteArray().

Also, we recently added VarOpt sampling as a new method for weighted sampling. Still working on documentation, but the name is short for "variance optimal" since it estimate compute subset sums for some predicate over the sampled points with optimal variance. The contrived example we were testing was using a sample over county populations in the US to estimate state populations, so the predicate was the state for which we wanted an estimate.

I'm not sure if it'll be of use, but I imagine things like the number of incoming edges to a node or the sum of incoming edge weights would be fairly obvious values to use for weights.

gaffer01 commented 7 years ago

I'll close this now. Serialisers and aggregators for the new HLL sketches have been written and merged into develop. For most of the sketches from data-sketches that we support, there are serialisers and aggregators for both the sketch form and the union form. We'll update this issue in future if we decide on general guidelines for which to use.

We will continue to watch the development of data-sketches, and will look out for updates that increase the sketches that support off-heap storage.

Thanks for the guidance.

leerho commented 6 years ago

@gaffer01

I want to let you know that our HLL sketch has now be extended to enable full off-heap operation. The javadocs are pretty much complete, but I will be adding more documentation to the DataSketches.github.io website over the next few weeks.

BTW, I will be presenting a paper at the ACM IMC-2017 conference in London, November 1-3.

If you would like, I can come and do a talk for your team on sketching technology while I'm in London. It can be completely customized to your interests. (It doesn't have to be the same talk I will be doing for the IMC) . Please let me know if this would be of interest, and how we can communicate to exchange planning and logistical information that really is not relevant for this issues forum.

Cheers.

gaffer01 commented 6 years ago

@leerho Thanks for the update on the HLL sketch. Look forward to seeing more documentation - more examples of how to use the off-heap approach would be very useful.

Thanks for the offer of the talk. We'll get back to you on that.

leerho commented 6 years ago

@gaffer01

I will be in London and have time available on Tuesday, Oct 31st. I will be at the ACM ICM 2017 conference Wed thru Friday.

I would be interested in doing a talk or just an informal feedback session on features of interest to you. It will be easier to talk about the off-heap stuff in front of a white-board.

You might be interested in a new paper by my colleague Kevin Lang: Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm.

And a recent comparison study I did of our HLL vs the "popular" HLL++ implemented by ClearSpring: HllSketch vs HyperLogLogPlus Sketch.

Thanks, Lee. leerho@gmail.com

leerho commented 6 years ago

@p013570 @gaffer01

Hello, I will be leaving for London in a few days. If it is at all possible to respond to my request above it would be greatly appreciated. Thanks.

gaffer01 commented 6 years ago

@leerho Apologies for the delay in replying - because this issue is closed I hadn't spotted your comments.

Thanks for the link to the paper of Kevin Lang and to the comparison of the Datasketches HLL with the Clearspring HLL++ - very useful.