Compare different string representations

Motivation: String data is commonplace in real-world data sets. Thus, DAPHNE supports strings as the value type for its matrices and frames. DAPHNE matrices are 2-dimensional arrays of values of a homogeneous type, while DAPHNE frames represent tabular data with an individual type per column. Thus, we are mostly interested in storing sequences of strings. Interestingly, there are many different ways to represent sequences of strings at a physical level, which offer different trade-offs regarding efficiency (e.g., in terms of the memory footprint and the runtime performance of various operators on the data) and the complexity of the representation and conversion to it. So far, DAPHNE supports only two very basic physical string representations. While these already enable DAPHNE to process string data sets, they are not ideal for performance.

Task: The goal of this project is to efficiently implement and compare additional string representations (both naive baselines and advanced approaches from the literature) as new value types in DAPHNE. Concrete examples include, e.g.:

char* (as a naive baseline)
Fixed/max-length strings with a pre-defined length (generalizing DAPHNE’s existing FixedStr16 value type)
Simple dictionary-encoded strings (non-order-preserving and order-preserving
Umbra strings [1]
Unique Strings Self-aligned Region (USSR) [2]

Besides representing the data, a range of typical operations/kernels should be supported on matrices and frames of these new string value types, e.g.:

reading from a file and storing to a file (e.g., CSV format)
elementwise comparisons (==, !=, <, <=, =>, >, SQL-style LIKE)
string-specific operations (e.g., lower, upper, length, concat)
transformations to numbers (e.g., casts to and from numerical types, one-hot-encoding, recoding)
data generation (e.g, fill, matrix literals, sample)
reorganization (e.g., transposition, sorting, reversing)
left/right indexing

These operations should be implemented efficiently by exploiting characteristics and potential auxiliary structures (e.g., a dictionary) of the respective string representation as well as of the data (e.g., sorted/unsorted, number of distinct values, min/mean/max string length).

Conduct experiments to investigate the trade-offs involved in using the individual string representations. Think about data characteristics (see the ones above) and access characteristics (e.g., point vs range predicates, only comparisons vs string manipulation) to showcase the (dis)advantages of the string representations.

Based on the insights gained from your experiment. Think about a strategy for selecting a suitable string representation for a given data set. The decision could be based on the data and access characteristics. Implement your decision strategy inside the DAPHNE compiler and evaluate its ability to select a suitable string representation.

Hints:

Get started with DAPHNE
- Set up your DAPHNE development environment and run the test suite.
- Read and follow the contribution guidelines.
- Have a look at the DAPHNE documentation, especially the pages on implementing built-in kernels, testing, executing DAPHNE locally, and using DaphneDSL.
- Have a look at existing codebase, e.g., the read-kernel in src/runtime/local/kernels/Read.h, the CSV reader and file meta data in src/runtime/local/io/, the file meta data parser in src/parser/metadata/, and the main daphne executable in src/api/cli/daphne.cpp where the program execution starts; to name just a few starting points.
  1. Get familiar with DAPHNE’s run-time data types DenseMatrix (src/runtime/local/datastructures/DenseMatrix.h) and Frame (src/runtime/local/datastructures/Frame.h) including their data layout.
  2. Have a look at how the two currently supported string value types std::string (e.g., the short string optimizations it employs) and FixedStr16 (src/runtime/local/datastructures/FixedSizeStringValueType.h) are integrated.
  3. Read up on the more advanced string representations from the literature mentioned above. Feel free to suggest other/additional efficient string representations.
  4. Think about which of the string representations are particularly suitable or unsuitable for which of the operations mentioned above. Focus your implementation on the combinations of string representation and operation that make most sense.
  5. All new features should be easily maintainable, i.e., they should be (1) covered by meaningful test cases, and (2) documented (developer docs page explaining the overall design and source code comments).
  6. The contributions made in the context of this project should be split up in multiple meaning full pull requests.

[1] Thomas Neumann, Michael J. Freitag: Umbra: A Disk-Based System with In-Memory Performance. CIDR 2020; Section 3.1 (only the in-memory part, not the on-disk part, for simplicity)

[2] Tim Gubner, Viktor Leis, Peter A. Boncz: Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR. ICDE 2020: 301-312; Section IV

daphne-eu / daphne

Compare different string representations #858