Add rough memory size tracking with KllItemsSketch

apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

https://datasketches.apache.org

Apache License 2.0

875 stars 207 forks source link

Add rough memory size tracking with KllItemsSketch #554

Closed ZacBlanco closed 1 month ago

ZacBlanco commented 2 months ago

When generating KllSketches, systems may need to have an idea about how much memory utilization there is for a particular sketch. For sketches with fixed-width types the answer can be computed efficiently.

With the KllItemsSketch, this is more difficult because the sketch can support String-types with variable widths. This commit adds implementation support to expose the getTotalItemsNumBytes method so that external systems can roughly track the memory utilization of a particular sketch.

The change accomplishes this by intercepting the code where a new item is added to the items array, or when a new array is generated entirely. This will add a slight overhead due to the sketch now needing to compute the length of inputs. For fixed-width types the overhead is low. For string this will require a call to encode the string as UTF-8/16 before adding it to the array.

For fixed-width types, the calculations have little effective overhead as the computation is a single array-access lookup + multiplication with the type width.

jmalkin commented 2 months ago

Can you elaborate more on what problem this is trying to solve? Why would getSerializedSizeBytes() not work, for instance, which would have zero speed impact on updates, especially for non-generics? And that would provide a more accurate representation of the space used. This approach slows down all instances and is guaranteed to under-estimate the memory use.

I clearly forgot to bring it up again before the 6.0 release, but we actually removed the serde from C++ constructors since there's no real reason why it needs to be a class member as opposed to being an input parameter when serializing and I wanted to suggest doing the same for Java. Examples where that could be more appropriate are a sketch of some complex item type isn't going to be serialized, in which case no serde is needed, or a sketch that gets written in different formats, where serializing with different serdes (say for backwards compatibility using an older format as well as a newer one going forward) is more efficient than creating two sketches with different serdes.

ZacBlanco commented 2 months ago

Hi Jon,

Thanks for your quick response. I think this change presents little overhead except for the ItemsSketch which uses non-fixed fixed-width types - specifically on Strings. I’m willing to iterate on this to find a solution that the community would be willing to accept. I don’t expect the PR to be accepted in the current state.

First, I’ll try to paint a picture of why I think this change is needed. Next, I’ll also provide a few of the microbenchmarks comparison between the master branch and this PR for a number of operations to show the impact of the PR on the performance.

Background

The system we are trying to use KLL sketches in is large distributed database engine. Tables can easily have 100TB+ of data. We’re implementing a KLL Sketch aggregation function for a variety different data types. Fixed-width types work great, but we’re also trying to implement them for strings.

Imagine a table with 1T rows, then you run a GROUP BY query on that table. The cardinality of the GROUP BY is somewhere in the range of 100M. The execution engine splits up the tasks and tries to schedule those tasks in a way that satisfies the memory constraints of the cluster. If you have 100M KLL sketches, depending on the parameters, it could amount to a not-insignificant chunk of memory and may need to make a task scheduling decision to move execution to another node based on that information. Hence, the system needs to track roughly how much memory an aggregation is using.

The design of the database engine is such that querying the current estimated size needs to be very fast, which means we need to incrementally track the size of KLL sketches as items are added. In the JVM, exact memory tracking is very expensive, but even a good ballpark number is useful for us scheduling these tasks.

In the current iteration of the sketch function implementation, we tried something like this:

memoryUsage -= sketch.getSerializedSizeBytes();
sketch.update(item);
memoryUsage += sketch.getSerializedSizeBytes();

Unfortunately, getSerializedSizeBytes() is so expensive, it slows the queries down to a crawl for all types in the ItemsSketch. It’s over ~100X slower to do this than to simply call sketch.update. One of the main culprits which causes the slowdown is that for ItemsSketch, we end up doing a potentially large array copy for every call. The call chain is:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllSketch.java#L320

which calls:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235

which calls:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L230

getRetainedItemsByteArr is doing a very large and useless allocation which makes getting the current size far too expensive.

I hope the above explanation clears up the motivation behind this PR. Please feel free to ask any questions.

Performance

I wrote some basic microbenchmarks with JMH to test the performance of these methods. The patch with benchmarks can be found here: https://github.com/ZacBlanco/datasketches-java/commit/102e383d72edd044e161c96e3031e28a303b14f5. The benchmarks test the performance of KLL sketches for doubles, floats, items (strings and doubles) for the methods: update, getSerializedSizeBytes, and getTotalItemsNumBytes. The results are below.

You can see that aside from the update for KllItemsSketch<String>, the performance of virtually every other method is unchanged. Additionally, for the Items sketch the cost of getTotalItemsNumBytes is multiple orders of magnitude lower than the cost of getSerializedSizeBytes()(Doubles: ~615 vs 10ns, Strings, ~200k ns vs 9ns), which solves my initial problem of having a quick way to get the rough memory usage of the ItemsSketch.

Additional Optimizations

I think there are two related, but slightly different optimizations that can be made.

First, in the ArrayOfStringsSerDe, to calculate the size of the byte array, the sizeOf method utilizes String#getBytes, which copies the existing String into a new byte array, and then returns the new array where only the length of the array is queried. This is quite slow and results in many more allocations than necessary. I would highly recommend the library adopt a method which calculates the length of the strings without these additional allocations.

There is an implementation in the Guava library which can probably be directly copied and dropped in to the datasketches library without having to pull in any dependencies. https://github.com/google/guava/blob/9596c996098272f0ea69cfdf45c7c1fe1ef8406d/guava/src/com/google/common/base/Utf8.java#L49

While I think there is no downside with this improvement, it doesn’t have a large enough performance impact to make getSerializedSizeBytes fast enough for me to use.

Second, I tested with tracking the size of getTotalNumBytes simply using String#length rather than doing a UTF-8 encoding inside of ArrayOfStringSerde#sizeOf. With that change, the update performance for the Strings sketch was virtually no different than without the size tracking. Unfortunately though, it means that we wouldn’t be able to support UTF-8 strings. Do you think there would be a good way to introduce using the String#length call to perform the incremental size tracking?

Thanks for taking the time to read. Please let me know your thoughts.

leerho commented 2 months ago

Hi Zac, Thank you very much for your suggestion! It really helps us understand the issues you face in using our sketches and how we can improve the library!

The first concern that jumps out at me is that your specific problem and the solution you propose is focused on items as Strings. Although I recognize that Strings may be a dominant use case for the items sketches, it is not the only application, as items sketches can be used for arbitrary objects. So I am hesitant to put into the sketch a lot of special code that is only useful for Strings.

That being said, let me ask you some questions and explore some other ideas with you. In your write-up above you said:

...the system needs to track roughly how much memory an aggregation is using.

This gives me hope! Nonetheless, how "rough" is this need? +/- 1%, 10%, 50% ? Because I think we can come up with a good approximation of the memory usage that may be much simpler than wedging a bunch of code for a specific use case into the sketch itself.
You gave some numbers above. Allow me to noodle with them (although I'm sure you did not want me to do this! )
- Total data: 100TB = 1e14 bytes
- Rows: 1T = 1e12 => ~ 100 bytes / row
- String items per row ? I assume 1 ?
- Groups: 100M = 1e8 groups => ~ 1e4 rows / group => 1e6 bytes / group
Do these extrapolations sound about right? If you are not sure, would it be possible for you to characterize your string length, row size, and group size distributions of your raw data? This would be just three KllFloatsSketches, one each for capturing the string lengths, the row sizes and the group sizes. Easier said than done, I'm sure, but this will be very useful, as I'll explain.

But first:

What value of K are you using for your sketches now? And what is your requirement for rank accuracy that led you to choose this value of K?
What are you using the sketches for?
- Data partitioning perhaps? If so, we have a whole lot more to discuss!
- Anomaly detection?
- How does knowing the data distribution per group (per day, per whatever?) help you?
I assume, although you didn't say exactly, that having an estimate of how much memory your aggregation functions are using is important for memory planning and partitioning, and more important than the fact that they are using memory. Is that correct?

--IDEA-- Sketches are statistical animals, so lets try to take advantage of statistics of your data to get approximate, yet useful estimations of the data you need for your memory planning.

The KLL sketches have a very well defined growth plan that looks something like the following. The red curve is for KllDoublesSketch and the blue curve is for KllFloatsSketch.

GrowthFloats Doubles_K200

Normalizing these two curves by the size of a double (8 bytes) and a float (4 bytes) produces this curve:

The message from these graphs is that once KLL sketches reach a certain size they grow very, very slowly. As you can see, from the red KllDoublesSketch curve, from 1M input doubles to 1B input doubles, the sketch only grows from 5KB to about 5.7KB or about 700 bytes. And going from 1B input doubles to 1T input doubles will cause the size to increase by another 700 bytes! This is because that final slope remains constant.

Now with the second normalized curve above, and a distribution of item sizes extracted from your actual data, we can create a statistical model that pretty closely estimates the actual memory usage of a KllItemsSketch being fed the same data. This approach could be used no matter what the item is, and it can be done while the KllItemsSketch is being fed data. This might mean having two sketches side by side: A floats sketch tracking the distribution of item sizes, and the actual items sketch tracking the distribution of the items themselves.

Would you be willing to try something like this?

Cheers, Lee.

ZacBlanco commented 2 months ago

Hi Lee, I really appreciate your response. Below are some answers to your questions.

The first concern that jumps out at me is that your specific problem and the solution you propose is focused on items as Strings. Although I recognize that Strings may be a dominant use case for the items sketches, it is not the only application, as items sketches can be used for arbitrary objects. So I am hesitant to put into the sketch a lot of special code that is only useful for Strings.

I realize my initial post mentions mostly strings because it's a harder problem to solve, but you can see from the performance benchmarks that getSerializedSizeBytes is also more than an order of magnitude slower for KllItemsSketch<Double> than the primitive version, KllDoublesSketch. My initial implementation using KLLs in our database used the Items sketch because of the API convenience of generics, however I didn't expect such a large performance discrepancy. The performance issue was initially identified due to slow aggregations which used KllItemsSketch<Double> and tracked their size through getSerializedSizeBytes.

Replacing the KllItemsSketch<Double> implementation for just double types with KllDoublesSketch is an option for us (and one I am currently working on), but I think there are optimizations to be made for the items sketch.

I would also like to clarify the use case here. I'm coming from an engine author perspective. I currently work with SQL query optimizers and we're using KLL sketches as the backing data structure for histogram column statistics on large tables. The main objective of using these sketches is to estimate the result of predicates=, !=, <, <=, >, >= etc applied to columns on a table which enables the optimizer's CBO to create better query plans.

... how "rough" is this need? +/- 1%, 10%, 50% ? Because I think we can come up with a good approximation of the memory usage that may be much simpler than wedging a bunch of code for a specific use case into the sketch itself.

The penalty for wrong estimates is queries that fail to execute due to excessive untracked memory usage and hitting heap limits. I'm not a database users, so I can't speak exactly to the variety of workloads. Data distributions also depend on customers/users. I would argue an absolute error of <= 5% is acceptable. >10% would be unacceptable. Overestimates are better than underestimates as overestimate would simply shift scheduling around in the cluster. This could cause query starvation under high load, but I think that would be preferred behavior compared to failing queries.

Do these extrapolations sound about right? If you are not sure, would it be possible for you to characterize your string length, row size, and group size distributions of your raw data?

As I mentioned, I am the engine author, not a consumer so I can't speak to the data distribution. In my original comment, I meant 1T as 1T rows. The table itself could be 3 columns or 300. For the sake of this discussion let's say the median string size is 64 bytes for a varchar-typed column, and we're focusing on the computation of a single column.

With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. Across 100M groups that's roughly 640k of data per group and 100K items per sketch in the group.

If you assumed sketches were ~5KiB/each (edit: I realized with strings as the input, the size of the sketch would be larger than this) with 100M groups you're looking at ~50 GiB of memory to store all of the sketches. You could drop the group number by an order of magnitude to 10M and still have ~5GiB of sketches accumulate in memory. Depending on cluster node size, this can be a good chunk of memory that needs to be accounted for in the JVM or else we could OOM.

You could probably even scale my original scenario down to 1B rows and 10M groups and the same math above would still apply

What value of K are you using for your sketches now? And what is your requirement for rank accuracy that led you to choose this value of K?

Currently, the engine defaults to the library default of k = 200. This could be configured by the DB user though.

I assume, although you didn't say exactly, that having an estimate of how much memory your aggregation functions are using is important for memory planning and partitioning, and more important than the fact that they are using memory. Is that correct?

Having an idea of the memory utilization is important to prevent query failures under system load and to ensure good scheduling decisions are made and that queries are not starved of resources when they may actually be available.

I do like your idea of using a separate sketch to track the distribution of strings sizes. However, it's not 100% clear to me how to deduce the memory utilization of the sketch using the distribution of sizes. I presume you would need to have some kind of information on the relationship of whether an item retained relates to its size which isn't clear to me because the retained items usually depend on insertion frequency? Either way, more elaboration on that idea would be great.

We are actually currently using a set of linear regression parameters which we computed to estimate the true in-memory size. It is calculated with a formula like true_size = linear_param * sketch.getSerializedSizeBytes() + constant_param. We calculated the parameters using our own canned test data. The parameters for the String parameter assume a constant length, but it could be far off depending on the user's data. An ideal solution would either build the size-based sketch on the fly, or not rely on pre-baked parameters at all. DB users could have vastly different data distributions could make the parameters inaccurate for different scenarios, which is why I attempted the solution in this PR.

leerho commented 2 months ago

Zac, Thank you! This is very helpful.

I am not discarding your suggested improvements to the ArrayOfStringsSerDe. The first one is especially important and we will definitely integrate that in the next release.

The second one is more of a demonstration of where the problem is. Of course you are free to extend these ArrayOf*SerDe classes to suit. Substituting the char[].length for the UTF-8 byte[].length is only useful in English speaking countries where ASCII predominates. Nonetheless, there are lots of situations where this assumption will not work.

As to your suggestions in the PR, I will have to study them more closely. Some I think are OK, others I will have to think about.

Back to your description of your use-case. Allow me to try to play back to you what I am gathering from your description:

Start with a very large table with ~1T (1e12) rows, each row can have several hundred columns.
"With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. Across 100M groups that's roughly 640k of data per group and 100K items per sketch in the group."
- Doesn't this assume you are analyzing only one column?
Nonetheless, this data per group will vary widely and likely be power-law distributed. (a few groups with millions of rows, and millions of groups with only a few rows).

It is my understanding that it is at this group level where you want to analyze the distribution of items, where you might have a KLL sketch per column, thus many sketches just for this group. And, of course, all the 100M groups would be configured with sketches as well implying a huge number of sketches for the entire query analysis -- thus the concern about memory usage.

Assuming that the above is roughly right, here is where I am a little confused (I am not a DB engine expert by any means!) Above you mentioned:

The main objective of using these sketches is to estimate the result of predicates=, !=, <, <=, >, >= etc applied to columns on a table which enables the optimizer's CBO to create better query plans.

Can you give me an example of how you use the distribution information from these sketches along with the user chosen predicates to help you with query optimization?
What are you looking for in the distributions from these sketches?
What decisions can you (or your query planning model ) make from these distributions?

Note: If you are looking for Heavy-Hitters the Frequent Items sketch might be very useful.

Note: If you are converting these sketches into PMF histograms, I should caution you about trying to used too many split-points for the given K of the sketch, because you can easily exceed the error threshold of the sketch (producing garbage). Our most recent release 6.0.0 has some extra functions to help you with this.

Given your estimate of an average of 64B per item, this is what the KLL Growth Path might look like:

This is obviously much bigger than 5KB! This is just the previous graph, which was normalized to one byte, times 64.

Cheers, Lee.

ZacBlanco commented 2 months ago

With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. Across 100M groups that's roughly 640k of data per group and 100K items per sketch in the group." Doesn't this assume you are analyzing only one column?

Yes, that example was just for one column, but it depends on the user query as to whether or not we're creating sketches for one column, or multiple. I was just using one column to demonstrate a simple scenario for memory usage.

It is my understanding that it is at this group level where you want to analyze the distribution of items, where you might have a KLL sketch per column, thus many sketches just for this group. And, of course, all the 100M groups would be configured with sketches as well implying a huge number of sketches for the entire query analysis -- thus the concern about memory usage.

Yes, this is correct. You could imagine this situation arises when a user writes something like:

SELECT kll_sketch(<column_x>), kll_sketch(<column_y>), ... FROM <table> GROUP BY <column1>, <column2>, ...

and <table> is the hypothetically very large table. I couldn't tell you exactly why a user might write a query like this, but it's something we have to support. We don't want to engine to crash if something like this were to be submitted.

Can you give me an example of how you use the distribution information from these sketches along with the user chosen predicates to help you with query optimization? What are you looking for in the distributions from these sketches? What decisions can you (or your query planning model ) make from these distributions?

I believe there may be some confusion with how I described uses for the KLL sketches. Let me try to clarify with some more examples.

I used the above GROUP BY query above as an example of something that the DB user could write. How they utilize the KLL sketch is entirely up to them. But in implementing support for generating KLL sketches for users in the database, we need to have memory tracking for this scenario. This is why in the implementation of the KLL sketch aggregation function I had some code similar to the following (referencing my initial response on this PR):

memoryUsage -= sketch.getSerializedSizeBytes();
sketch.update(item);
memoryUsage += sketch.getSerializedSizeBytes();

Unfortunately, due to engine design, we can't really special-case function implementations easily for queries with and without these GROUP BY statements. If queries don't have GROUP BY statements, memory isn't as large of a concern. Most aggregation functions' state rarely exceeds the roughly the 1-megabyte range, so poor memory utilization estimates have a smaller probability of causing issues. However, because of the necessity to implement the memory size tracking, it caused poor performance for queries even without a GROUP BY clause. Since the root cause of the poor performance was KllItemsSketch#getSerializedSizeBytes, I was motivated to open this PR.

The hypothetical scenario of a large GROUP BY query described above is a little different from how we (the DB engine authors) intend to use the KLL sketches. We aren't doing GROUP BY queries when we try to generate statistics for a table, so the amount of memory required for the aggregation is far lower and less of a concern, but the query runtime performance is. Essentially generating statistics for a table amounts to running a query like

SELECT kll_sketch(<column1>), kll_sketch(<column2>), ... FROM <table>

The resulting sketches from the query are stored in a location where they can be queried later by the optimizer. Suppose a user comes along and writes the query

SELECT * from <table> WHERE <column1> <= 5

The optimizer will analyze the query and generate a "Filter" operator representing <column1> <= 5. The optimizer attempts to estimate the output row count from all nodes in the SQL query plan. For the "Filter" operator, the optimizer will get the sketch stored for <column1> and use it to compute the proportion of rows that would be returned by the table scan which satisfy the filter <= 5.

To motivate the reason why we want accurate filter estimates, take the following scenario: a user has a very complex query with multiple (>2) JOIN statements. You can improve query performance by re-arranging the order which tables are joined together by joining tables with smaller result sets earlier. So, having an accurate estimate of the number of records that satisfy the filter on a table scan can significantly improve query runtime.

leerho commented 2 months ago

Thanks Zac, this is super helpful. I was not aware that you offer sketches for use by your DB-users. That is wonderful! Now I realize you have two use cases: one for User Queries (UQ), and one for Query Planning and Optimization (QPO) used by the DB developers.

Is it fair to assume that the QPO task is always performed prior to any UQ? In other words, you have the opportunity to learn a lot about the user's tables prior to the user using them.

If this is true, then from the QPO pass, you could have in advance of any UQ the distribution of element sizes of any column the user might select. (I would hope that this kind of table metadata could be constructed as a table is being built, but that is a different discussion.)

As I have shown you in several graphs, the average size of KLL sketch is quite predictable from k the configured sketch size, n the total number of elements fed to the sketch and s the size of the elements fed to the sketch. If s is a random variable, then all we need is the CDF. Most importantly, we can establish upper bounds on both of those values. The most conservative approach would be to set n to be the number of rows of the entire table and s to be the largest element in the column of interest. From that we can predict pretty accurately (and very quickly) predict what the maximum size of the sketch should be. If we discover from experience, that this size is too conservative (large), from the floats sketch that computes the CDF of size, we could choose the 90th percentile size instead of the max size, etc.

In other words, instead of trying to track the size of the sketch before and after each update, which is hugely expensive, can we compute a memory budget for the sketch to grow in and use that for memory planning?

One more question I wanted to ask you is what Java version are you using? We are in the planning stages to move to Java 17 and 21, but we will have to draw a hard line in the sand for Java 17. In other words, at a specific DataSketches Version (perhaps version 7 or 8), Java 17+ will be required. If you will still need Java versions < 17, you will have to use earlier versions of the library. How will this impact you?

Cheers, Lee.

ZacBlanco commented 2 months ago

Is it fair to assume that the QPO task is always performed prior to any UQ? In other words, you have the opportunity to learn a lot about the user's tables prior to the user using them.

It's not necessarily the case that we always have the statistics for QPO before UQ, though we recommend most users spend the resources to do so as it can make the queries execute quicker. It depends on the use case though. I think we would be okay with a heuristic.

I think your suggestion of calculating a bound for the memory would work for us. I think collecting a sketch of data sizes alongside the actual sketch could also be a feasible solution that would give us the memory bounds even if the sketch size is unknown and would be less overhead than the current approach. I appreciated you sharing the graphs of the sketch size for a given N. However, I couldn't identify a given formula in the KLL paper^1 which corresponds to the same types of curves you shared. Could you help point me in the right direction so I could incorporate them?

Given that I can compute a rough bound on the size of the sketch, I think that parts of this PR would not be necessary. However, I do believe that the addition to the ArrayOfItemsSerde interface of isFixedWidth would still be beneficial for KllItemsSketch to improve the performance of getSerializedSizeBytes(). I can modify the PR to include only that part if you agree with that.

One more question I wanted to ask you is what Java version are you using? We are in the planning stages to move to Java 17 and 21, but we will have to draw a hard line in the sand for Java 17. In other words, at a specific DataSketches Version (perhaps version 7 or 8), Java 17+ will be required. If you will still need Java versions < 17, you will have to use earlier versions of the library. How will this impact you?

Currently our build actually relies on Java 8. We are moving to Java 11 soon. Personally, I would like our project to move to more modern versions, but I don't have enough influence to make that happen. It's also complex decision for us to make as many companies use our system. If you move Java versions upwards we would either rely on patch builds to a version before 17, or would just wait until we move to a later Java version to incorporate new versions of the datasketches library.

leerho commented 2 months ago

The part of the ds-java library that is Java version sensitive is the ds-memory component, which we use for off-heap management. And currently ds-memory is supported for Java 8 and 11. With Java 17, there is no longer a need for our Memory component as Java 17 incorporates the new Panama FFM capability as a preview (or incubation). The FFM will be far more capable and faster, since it leverages native code and built right into the language.

With Java 17 for the ds-java library we are also going to break up our library into Java Modules (JPMS) as well as move the build tools to Gradle. This means you will only need to download the specific sketches you need which will be in separate jars.

The Gradle build shouldn't impact you unless you wish to build from source. Also the JPMS shouldn't bother you since you can place all these modular jars on the classpath (instead of the module path) and everything should just work 😄.

WRT the PR. I am still reviewing your code and I agree that the "isFixedWidth" or isItemFixedSize" (might be a little clearer) is a good contribution. But you might have a few more nuggets in there.

WRT the graph. You won't find any formula for producing that graph in the paper. But it can be inferred from how the sketch grows, which is discussed in the paper. Building the graph is rather involved and uses code deep in the bowels of the sketch. I have written a utility that produces it for me, but it is pretty much for my own research. In your case, I don't think you want the graph, per se, I think what you want is the ability to compute the expected size based on K, n, and item size. That could be an improvement in the current getMaxSerializedSizeBytes(...) and make it more general.

leerho commented 2 months ago

The one concept that I want you to take away from our discussion about KLL sketch size is this: When n gets large enough, the internal sketch structure reaches a stable form. This is where you see the curve slope dramatically flatten. From that point on, as n continues to grow the sketch size will increase by adding only 8 items to its retained items for every factor of 2 growth in n! For example, looking at the last graph, the stable form is reached at about 100K 64B items. So if n grows from 100k to 200k items the sketch size increases by about 8 items. -- Or, growing from 1T items to 2T items the sketch increases in size by about 8 items! Thus, planning for a memory space for the sketch to grow in should be straight-forward.

leerho commented 2 months ago

Oh, and one more question. Are you using (or would like to use) the Direct mode KLL sketches (only for fixed-item-size sketches for now)?

ZacBlanco commented 2 months ago

When you say "direct mode", I'm interpreting that as you referring to KllDoublesSketch rather than KllItemsSketch<Double>. If I am correct with that assumption, then yes I think we will be using them. After benchmarking, I'm convinced the performance benefit is going to be worth the additional implementation complexity on our side, so I would like to use them. Hence why I opened #556. That long version of the sketch should be the only other implementation that we will need as most numeric data that benefits from the sketches in the initial implementation is either double or long.

ZacBlanco commented 2 months ago

re:

... I think what you want is the ability to compute the expected size based on K, n, and item size. That could be an improvement in the current getMaxSerializedSizeBytes(...) and make it more general.

Yes, that's what I'm looking for. I was hoping that was either referred to in the paper. I did see https://github.com/apache/datasketches-java/blob/923b30f1741c478dd843d3b00da121061f520799/src/main/java/org/apache/datasketches/kll/KllSketch.java#L161 while digging around in the code, but it doesn't currently support the Items sketch. If I am interpreting the function name correctly, this might be what I need?

leerho commented 2 months ago

Yes. When I wrote that function, I couldn't figure out a an easy way to handle variable sized items and that pretty much ruled out generic items sketches. Since then, I have developed a more sophisticated growth model for the sketch, where I can predict any point on the growth curve assuming an item size -- what you see in those graphs. For your case, coupled with a distribution of item sizes, and with this new version of getMaxSerializedSizeBytes(k, n, itemSize), we can pick a reasonable quantile from the size distribution and plug it into this new function. It could be the max item size, but if your distribution is more power-law, that would be wasteful. So we could pick the median, or what ever.

Once we have the full item-size distribution, it allows us to ask questions like: "if I choose an quantile from the size distribution and my n turns out in practice to be doubled in size, what would the size of the sketch be?" Easy-peezy.

leerho commented 1 month ago

Through our discussion, Zac has decided he is willing to try another approach. I will close this PR now, but it can always be reopened.