Motivation: Data science workflows are exploratory by nature. While developing a data analysis pipeline, variants of the same script are executed repeatedly, often differing only slightly. In many cases, these variants access the same input data sets. In data-intensive applications, reading the input data sets from secondary storage can contribute substantially to the overall script execution time. Reading a data set involves multiple steps, such as file I/O and parsing the specific file format. For instance, when reading a CSV file for the first time, the data must be read from secondary storage, the number and data types of columns must be identified, individual rows and values must be identified by searching for row and column separators, and the values must be parsed into the desired data type (e.g., double, integer, or string). By default, subsequent reads of the same file must repeat all these steps. However, subsequent reads could benefit from information on the file already obtained by the first read. For instance, the information on the number and data types of all columns could easily be stored in a file and re-used by subsequent reads. In theory, one could imagine the entire spectrum from fully re-reading the file every time over keeping some compact meta data or very detailed information to keeping the entire read data in binary form. Each point in this spectrum comes with its individual trade-off w.r.t. runtime improvement of subsequent reads vs runtime overhead for the creation during the first read and additional storage/memory requirement for the auxiliary structure. Thus, users should be able to specify a certain storage/memory budget for facilitating repeated reads.
Task: This project is about exploring different techniques to retain information gathered about a data set file across independent executions of a DaphneDSL script to improve the performance of repeated reads given a storage/memory budget for auxiliary data structures. The impact of employing these techniques shall be evaluated through experiments.
At the moment, DAPHNE implements one specific point in the spectrum mentioned above: For each data set file to read, there must be a meta data file containing information on the shape, value types, and schema. Requiring such a file can be a burden to users. Thus, a first step could be to make the presence of a meta data file optional by enabling DAPHNE to automatically infer the necessary information (see also #63 and #688). The infered information could be stored in a meta data file and reused for subsequent reads. The runtime-storage trade-off of this re-use should be investigated with an experiment.
In a second step, design and implement 1-3 additional techniques (depending on their complexity) and experimentally evaluate their usefulness. Such techniques could be positional maps to speed up CSV parsing, caching of the read data or auxiliary structures in memory or on storage to save I/O, and replication of the read data in a different (e.g., more I/O-friendly) format. Inspiration can be found in the literature, e.g., in the NoDB paper by Alagiannis et al. [1].
Design and implement a way for discarding the auxiliary information in case the data set file is updated. Users should not be required to explicitly notify DAPHNE about the change.
[1] Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, Anastasia Ailamaki: NoDB: efficient query execution on raw data files. SIGMOD Conference 2012: 241-252
Have a look at exising codebase, e.g., the read-kernel in src/runtime/local/kernels/Read.h, the CSV reader and file meta data in src/runtime/local/io/, the file meta data parser in src/parser/metadata/, and the main daphne executable in src/api/cli/daphne.cpp where the program execution starts; to name just a few starting points.
Focus on the widely used CSV format for the input data set files.
Think of meaningful experiments to investigate the runtime-storage tradeoff of each introduced technique. Use a variety of real and synthetic data sets with different sizes and characteristics for the experiments. Compare DAPHNE to itself (with and without the techniques turned on) and to other baseline systems.
The user-defined storage/memory budget for auxiliary structures should become part of the DAPHNE configuration.
All new features should be easily maintainable, i.e., they should be (1) covered by meaningful test cases, and (2) documented (developer docs page explaining the overall design and source code comments).
The contributions made in the context of this project should be split up in multiple meaning full pull requests.
Motivation: Data science workflows are exploratory by nature. While developing a data analysis pipeline, variants of the same script are executed repeatedly, often differing only slightly. In many cases, these variants access the same input data sets. In data-intensive applications, reading the input data sets from secondary storage can contribute substantially to the overall script execution time. Reading a data set involves multiple steps, such as file I/O and parsing the specific file format. For instance, when reading a CSV file for the first time, the data must be read from secondary storage, the number and data types of columns must be identified, individual rows and values must be identified by searching for row and column separators, and the values must be parsed into the desired data type (e.g., double, integer, or string). By default, subsequent reads of the same file must repeat all these steps. However, subsequent reads could benefit from information on the file already obtained by the first read. For instance, the information on the number and data types of all columns could easily be stored in a file and re-used by subsequent reads. In theory, one could imagine the entire spectrum from fully re-reading the file every time over keeping some compact meta data or very detailed information to keeping the entire read data in binary form. Each point in this spectrum comes with its individual trade-off w.r.t. runtime improvement of subsequent reads vs runtime overhead for the creation during the first read and additional storage/memory requirement for the auxiliary structure. Thus, users should be able to specify a certain storage/memory budget for facilitating repeated reads.
Task: This project is about exploring different techniques to retain information gathered about a data set file across independent executions of a DaphneDSL script to improve the performance of repeated reads given a storage/memory budget for auxiliary data structures. The impact of employing these techniques shall be evaluated through experiments.
[1] Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, Anastasia Ailamaki: NoDB: efficient query execution on raw data files. SIGMOD Conference 2012: 241-252
Hints:
read
-kernel insrc/runtime/local/kernels/Read.h
, the CSV reader and file meta data insrc/runtime/local/io/
, the file meta data parser insrc/parser/metadata/
, and the maindaphne
executable insrc/api/cli/daphne.cpp
where the program execution starts; to name just a few starting points.