daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

Extend and optimize data transfer between Python and DAPHNE #899

Open pdamme opened 2 weeks ago

pdamme commented 2 weeks ago

Motivation: DAPHNE offers its own domain-specific language DaphneDSL for linear and relational algebra. However, the most popular language for data science nowadays is Python, and data scientists cannot be expected to switch to DaphneDSL. Therefore, DAPHNE offers a Python API, the so-called DaphneLib. DaphneLib offers DAPHNE’s operations in Python, while still guaranteeing DAPHNE’s unique performance characteristics through lazy evaluation. To allow a seamless integration of DAPHNE into the existing Python-based data science ecosystem, DaphneLib supports efficient bi-directional data exchange with established Python libraries for data science, such as numpy, pandas, TensorFlow, and PyTorch. However, the existing implementation has some limitations, e.g., focus on numeric data types and a small range of data structures. Furthermore, a systematic and repeatable performance evaluation is still missing.

Task: This project is about benchmarking, optimizing, and extending the existing data transfer features of DaphneLib. It can be split into the following main sub-tasks:

  1. A systematic evaluation of DaphneLib’s data transfer features to find out if the data transfer with all hitherto supported Python libraries works correctly (e.g., does the correct data arrive at the receiver, is the data still intact at the sender, etc.) and efficiently (e.g., is it really zero-copy or are there hidden overheads) in both directions (Python to DAPHNE, DAPHNE to Python). The dimensions of this systematic evaluation include:

    • different Python data structures: numpy.array, pandas.DataFrame (including sparse and categorical dtypes), pandas.Series, tensorflow.tensor, torch.tensor, with the corresponding DAPHNE data types (DenseMatrix and Frame)
    • different value types: floating-point numbers (64 and 32 bits), signed/unsigned integers (64, 32, 16, 8 bits)
    • different shapes (including 0-lengths; 1d, 2d, >2d objects)
    • transferring DAPHNE views into matrices/frames from DAPHNE to Python

    In case the evaluation reveals bugs or performance bottlenecks, these should be fixed in the course of the project (depending on their complexity). The systematic tests should become a part of DAPHNE’s test suite (testing for correctness and ideally even for zero-copy transfer).

  2. Extending DaphneLib’s data transfer features by the following:
    • String data transfer: So far, DaphneLib can only transfer numeric value types. However, we also need to be able to transfer string data. For instance, it should be possible to transfer numpy arrays of string dtype and pandas DataFrames with string columns to DAPHNE and back. To this end, find out which internal data layout is used by the mentioned Python libraries. If possible, adopt a zero-copy approach. Otherwise, try to make any necessary conversions and transfer as efficient as possible.
    • Support for more Python data structures: While DaphneLib already supports several important Python data structures, there are still some missing, e.g.:
      • Ordinary Python lists: Supporting these would be super useful for writing test cases (currently, there is no simple way to have a small example matrix of arbitrary values in DaphneLib). It should be possible to do something like dc.from_python([1, 2, 3])
      • numpy.vector (see also #616)
      • pandas.Series (currently done via conversion to DataFrame; if this is inefficient, maybe it can be improved)
      • pandas categorical dtype (currently converted to standard DataFrame; if this is inefficient, maybe it can be improved)
      • Sparse matrices: support for efficient data transfer between structures like scipy.sparse.csr_matrix, pandas.arrays.SparseArray, tensorflow.sparse.SparseTensor and DAPHNE’s CSRMatrix
      • Optionally the data structures of other Python libraries from fields such as image processing (e.g., Pillow/PIL) or audio. These should be optional dependencies like TensorFlow/PyTorch are.
  3. End-to-end experiments showcasing the DaphneLib’s data transfer in comparison to other systems than can interact with Python libraries. Here, one could implement some simple integrated data analysis pipelines partly in Python and partly in DAPHNE (or baselines). An example could be reading a data set in Python, transferring it to DAPHNE and processing it there using existing algorithm implementations (e.g., decision trees, linear regression), and transferring the results back to Python, maybe even plotting them in Python.

Hints: