Closed pdamme closed 5 months ago
Please note that the existing solution for efficient data exchange with numpy still needs to be merged into main, which will be done in due time.
FYI: The pioneering work for efficient data exchange between DAPHNE and numpy has just been merged into main and is ready to be looked at and to be extended upon now (see src/api/python/
).
While finalizing it, I noticed a few more interesting problems in that context, which could well be addressed in this project:
CSRMatrix
to Python. The most straightforward solution would be to convert it to a DenseMatrix
first. Or, in case the Python/numpy side is ready to handle sparse matrices, more efficient techniques could be devised.DenseMatrix
: when a view doesn't cover all columns of the original matrix). Most likely, such cases are not correctly supported yet.Leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.
Great, thanks for letting everyone know! We're always happy about new contributors.
Also leaving a comment to confirm that I am working on this issue as part of my LDE Student Project.
Thanks, I assigned you as well.
Pull Request created: #585
Motivation: DAPHNE has its own domain-specific language DaphneDSL for linear and relational algebra for research on new language features and abstractions. However, the most popular language for data science nowadays is Python, and data scientists cannot be expected to switch to DaphneDSL. Therefore, DAPHNE offers a Python API, the so-called DaphneLib. DaphneLib offers DAPHNE’s operations in Python, while still guaranteeing DAPHNE’s unique performance characteristics through lazy evaluation. To allow a seamless integration of DAPHNE into the existing Python-based data science ecosystem, exchanging data between DaphneLib and established Python libraries for data science, such as numpy, pandas, TensorFlow, and PyTorch must be simple. A naive solution could be to write data in one of these libraries to a file and to read it again in DAPHNE (or vice-versa). However, this generic approach is inacceptably inefficient and would hinder complex data analysis tasks combining DAPHNE and other libraries. Instead, an efficient data transfer via means such as shared memory (ideally in a zero-copy manner) or inter-process communication should be favored. Recently, the initial infrastructure for a zero-copy data exchange between DAPHNE and numpy has been developed in the context of a bachelor thesis in DAPHNE.
Task: This project is about extending the existing infrastructure by supporting the efficient data exchange with more data science libraries, especially pandas, TensorFlow, and PyTorch. This will directly contribute to the connection of DAPHNE with the existing data science ecosystem and, thereby, make DAPHNE simpler to use. Implementation in Python and C++.
Hints on approaching this task:
Suggested task extensions for larger teams: