daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 59 forks source link

Connecting DAPHNE to the data science ecosystem: Efficient data exchange with popular Python libs #499

Closed pdamme closed 5 months ago

pdamme commented 1 year ago

Motivation: DAPHNE has its own domain-specific language DaphneDSL for linear and relational algebra for research on new language features and abstractions. However, the most popular language for data science nowadays is Python, and data scientists cannot be expected to switch to DaphneDSL. Therefore, DAPHNE offers a Python API, the so-called DaphneLib. DaphneLib offers DAPHNE’s operations in Python, while still guaranteeing DAPHNE’s unique performance characteristics through lazy evaluation. To allow a seamless integration of DAPHNE into the existing Python-based data science ecosystem, exchanging data between DaphneLib and established Python libraries for data science, such as numpy, pandas, TensorFlow, and PyTorch must be simple. A naive solution could be to write data in one of these libraries to a file and to read it again in DAPHNE (or vice-versa). However, this generic approach is inacceptably inefficient and would hinder complex data analysis tasks combining DAPHNE and other libraries. Instead, an efficient data transfer via means such as shared memory (ideally in a zero-copy manner) or inter-process communication should be favored. Recently, the initial infrastructure for a zero-copy data exchange between DAPHNE and numpy has been developed in the context of a bachelor thesis in DAPHNE.

Task: This project is about extending the existing infrastructure by supporting the efficient data exchange with more data science libraries, especially pandas, TensorFlow, and PyTorch. This will directly contribute to the connection of DAPHNE with the existing data science ecosystem and, thereby, make DAPHNE simpler to use. Implementation in Python and C++.

Hints on approaching this task:

Suggested task extensions for larger teams:

pdamme commented 1 year ago

Please note that the existing solution for efficient data exchange with numpy still needs to be merged into main, which will be done in due time.

pdamme commented 1 year ago

FYI: The pioneering work for efficient data exchange between DAPHNE and numpy has just been merged into main and is ready to be looked at and to be extended upon now (see src/api/python/).

While finalizing it, I noticed a few more interesting problems in that context, which could well be addressed in this project:

danielwetzel commented 1 year ago

Leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.

pdamme commented 1 year ago

Great, thanks for letting everyone know! We're always happy about new contributors.

Niklas-Ventker commented 1 year ago

Also leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.

pdamme commented 1 year ago

Thanks, I assigned you as well.

danielwetzel commented 1 year ago

Pull Request created: #585