deephaven / deephaven-docs-community

Source code for Community docs on the deephaven.io website.
Apache License 2.0
0 stars 5 forks source link

Memory usage #199

Closed margaretkennedy closed 2 months ago

margaretkennedy commented 3 months ago

I have a question regarding resource utilization and parquet files. I understand that when we load a Parquet file into a table we do not marge the entire file into memory. Is this true if we are merging that loaded data with kafka streaming data? What effect does performing an aggregation on this merge table, such as last_by , have on memory utilization?

Correct. Deephaven only ever reads the data it has to, for columnar sources like Parquet. It’s still true when merging/joining/aggregating. To explain, there are a few “phases”: First, depending on what kind of layout you’re dealing with, Deephaven may have to do some work to find the Parquet files, and discover a schema. If you filter (with where) on the partitioning columns (if that’s relevant to your layout), Deephaven will prune some files from the result table. Once you do some kind of table operation that requires interacting with data or with the number of rows, Deephaven consumes enough of the parquet file metadata to know what rows exist in which files and row groups. Accessing column data causes Deephaven to then read the data pages containing the required rows. Materialized data pages are cached in memory in a heap-usage sensitive cache. So, imagine you had a where on some partitioning columns and some other columns, followed by a merge with another table, followed by an aggBy. In that case, Deephaven would: First prune some files based on the partitioning columns Next, read footer metadata for the remaining files. Then read (and cache) the columns accessed by the remaining filters in the files that haven’t been pruned. Then read no data all for the merge. Then read all the pages that contain any rows not filtered out in (3) for the “by” columns or the aggregation columns (e.g. agg by “A” and “B” aggregating max(“C”) and last(“D”) reads only “A”, “B”, and “C”. We might read “D” in some special cases, like a ticking append-only or blink table, but only the actual “last” row for each bucket.)