Connecting DAPHNE to the data science ecosystem: Efficient data exchange with popular Python libs

pdamme commented 1 year ago

Motivation: DAPHNE has its own domain-specific language DaphneDSL for linear and relational algebra for research on new language features and abstractions. However, the most popular language for data science nowadays is Python, and data scientists cannot be expected to switch to DaphneDSL. Therefore, DAPHNE offers a Python API, the so-called DaphneLib. DaphneLib offers DAPHNE’s operations in Python, while still guaranteeing DAPHNE’s unique performance characteristics through lazy evaluation. To allow a seamless integration of DAPHNE into the existing Python-based data science ecosystem, exchanging data between DaphneLib and established Python libraries for data science, such as numpy, pandas, TensorFlow, and PyTorch must be simple. A naive solution could be to write data in one of these libraries to a file and to read it again in DAPHNE (or vice-versa). However, this generic approach is inacceptably inefficient and would hinder complex data analysis tasks combining DAPHNE and other libraries. Instead, an efficient data transfer via means such as shared memory (ideally in a zero-copy manner) or inter-process communication should be favored. Recently, the initial infrastructure for a zero-copy data exchange between DAPHNE and numpy has been developed in the context of a bachelor thesis in DAPHNE.

Task: This project is about extending the existing infrastructure by supporting the efficient data exchange with more data science libraries, especially pandas, TensorFlow, and PyTorch. This will directly contribute to the connection of DAPHNE with the existing data science ecosystem and, thereby, make DAPHNE simpler to use. Implementation in Python and C++.

Hints on approaching this task:

Get familiar with (1) the data models and in-memory data representations of DAPHNE and other data science libraries, such as numpy, pandas, TensorFlow, and PyTorch, and (2) the existing solution for efficient data exchange with numpy in DAPHNE.
Following the lines of the efficient data exchange with numpy and building upon it, design approaches for the efficient data exchange with those additional Python libraries. In how far can their data representations be reused in DAPHNE, or in how far do they need to be converted? Could numpy’s data representation be used as a mediator? How could n-dimensional tensors in other libraries be mapped to 2-dimensional matrices in DAPHNE? How can data objects of various dtypes (value types in DAPHNE terminology be supported)?
Implement your design, including tests and documentation.
Think of meaningful experiments to evaluate how efficient the data exchange between DAPHNE and the other libraries is (in both directions). If you have devised multiple approaches, compare them to each other. Conduct the experiments, visualize, and interpret the results.

Suggested task extensions for larger teams:

Support the efficient data exchange with even more Python libraries for data science or special data modalities such as images or audio.

pdamme commented 1 year ago

Please note that the existing solution for efficient data exchange with numpy still needs to be merged into main, which will be done in due time.

pdamme commented 1 year ago

FYI: The pioneering work for efficient data exchange between DAPHNE and numpy has just been merged into main and is ready to be looked at and to be extended upon now (see src/api/python/).

While finalizing it, I noticed a few more interesting problems in that context, which could well be addressed in this project:

Transferring DAPHNE's sparse CSRMatrix to Python. The most straightforward solution would be to convert it to a DenseMatrix first. Or, in case the Python/numpy side is ready to handle sparse matrices, more efficient techniques could be devised.
Memory management: Currently, we make sure that data transferred from numpy to DAPHNE is not freed by DAPHNE, which is good. We also make sure that data transferred from DAPHNE to numpy is not freed by DAPHNE, which might imply a memory leak. This case would need some further investigation.
DAPHNE allows slicing out zero-copy views of any rectangular shape from existing matrices and frames. The memory underlying a view is not necessarily contiguous (in the case of DenseMatrix: when a view doesn't cover all columns of the original matrix). Most likely, such cases are not correctly supported yet.

danielwetzel commented 1 year ago

Leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.

pdamme commented 1 year ago

Great, thanks for letting everyone know! We're always happy about new contributors.

Niklas-Ventker commented 1 year ago

Also leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.

pdamme commented 1 year ago

Thanks, I assigned you as well.

danielwetzel commented 1 year ago

Pull Request created: #585

daphne-eu / daphne

Connecting DAPHNE to the data science ecosystem: Efficient data exchange with popular Python libs #499