daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Compare different string representations #858

Open pdamme opened 1 month ago

pdamme commented 1 month ago

Motivation: String data is commonplace in real-world data sets. Thus, DAPHNE supports strings as the value type for its matrices and frames. DAPHNE matrices are 2-dimensional arrays of values of a homogeneous type, while DAPHNE frames represent tabular data with an individual type per column. Thus, we are mostly interested in storing sequences of strings. Interestingly, there are many different ways to represent sequences of strings at a physical level, which offer different trade-offs regarding efficiency (e.g., in terms of the memory footprint and the runtime performance of various operators on the data) and the complexity of the representation and conversion to it. So far, DAPHNE supports only two very basic physical string representations. While these already enable DAPHNE to process string data sets, they are not ideal for performance.

Task: The goal of this project is to efficiently implement and compare additional string representations (both naive baselines and advanced approaches from the literature) as new value types in DAPHNE. Concrete examples include, e.g.:

Besides representing the data, a range of typical operations/kernels should be supported on matrices and frames of these new string value types, e.g.:

These operations should be implemented efficiently by exploiting characteristics and potential auxiliary structures (e.g., a dictionary) of the respective string representation as well as of the data (e.g., sorted/unsorted, number of distinct values, min/mean/max string length).

Conduct experiments to investigate the trade-offs involved in using the individual string representations. Think about data characteristics (see the ones above) and access characteristics (e.g., point vs range predicates, only comparisons vs string manipulation) to showcase the (dis)advantages of the string representations.

Based on the insights gained from your experiment. Think about a strategy for selecting a suitable string representation for a given data set. The decision could be based on the data and access characteristics. Implement your decision strategy inside the DAPHNE compiler and evaluate its ability to select a suitable string representation.

Hints:


[1] Thomas Neumann, Michael J. Freitag: Umbra: A Disk-Based System with In-Memory Performance. CIDR 2020; Section 3.1 (only the in-memory part, not the on-disk part, for simplicity)

[2] Tim Gubner, Viktor Leis, Peter A. Boncz: Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR. ICDE 2020: 301-312; Section IV