daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Speed-up repeated read of data set files with storage budget #857

Open pdamme opened 1 month ago

pdamme commented 1 month ago

Motivation: Data science workflows are exploratory by nature. While developing a data analysis pipeline, variants of the same script are executed repeatedly, often differing only slightly. In many cases, these variants access the same input data sets. In data-intensive applications, reading the input data sets from secondary storage can contribute substantially to the overall script execution time. Reading a data set involves multiple steps, such as file I/O and parsing the specific file format. For instance, when reading a CSV file for the first time, the data must be read from secondary storage, the number and data types of columns must be identified, individual rows and values must be identified by searching for row and column separators, and the values must be parsed into the desired data type (e.g., double, integer, or string). By default, subsequent reads of the same file must repeat all these steps. However, subsequent reads could benefit from information on the file already obtained by the first read. For instance, the information on the number and data types of all columns could easily be stored in a file and re-used by subsequent reads. In theory, one could imagine the entire spectrum from fully re-reading the file every time over keeping some compact meta data or very detailed information to keeping the entire read data in binary form. Each point in this spectrum comes with its individual trade-off w.r.t. runtime improvement of subsequent reads vs runtime overhead for the creation during the first read and additional storage/memory requirement for the auxiliary structure. Thus, users should be able to specify a certain storage/memory budget for facilitating repeated reads.

Task: This project is about exploring different techniques to retain information gathered about a data set file across independent executions of a DaphneDSL script to improve the performance of repeated reads given a storage/memory budget for auxiliary data structures. The impact of employing these techniques shall be evaluated through experiments.

[1] Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, Anastasia Ailamaki: NoDB: efficient query execution on raw data files. SIGMOD Conference 2012: 241-252

Hints: