daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
68 stars 59 forks source link

String as value type of DenseMatrix #230

Open daphne-eu opened 2 years ago

daphne-eu commented 2 years ago

In GitLab by @pdamme on Mar 18, 2022, 11:42

So far, DAPHNE supports strings only for scalars, but not for the values in a matrix or frame. However, data sets with string columns are commonplace and play an important role in both machine learning and database applications. Therefore, we must support those.

The task is to implement (C++) a specialization of DenseMatrix for the string value type. The major challenge is that each value could have its individual length. Instead of storing each value as a separately allocated string, a more compact and cache-friendly structure is desired. For instance, one could store the concatenation of all string values in one large buffer, and make the entries of the matrix pointers into that buffer. The data structure shall implement the typical interface of a matrix in DAPHNE, including get/set/append for individual values, direct access to the underlying buffers, and the creation of (zero-copy) views (row/column segments) into the data.

In addition to the data representation itself, a handful of important operations on a matrix of strings shall be implemented, whereby the operations can be chosen depending on personal interests and team size. Interesting candidates include:

daphne-eu commented 2 years ago

In GitLab by @pdamme on Mar 22, 2022, 18:27

mentioned in commit 783dd44680f7c8483ab768b7c38dd952f90a05f8

akroviakov commented 2 years ago

Assigned to @akroviakov.

pdamme commented 2 years ago

@akroviakov: I recommend specializing for DenseMatrix<const char *>, since we use const char * (not std::string) to represent string scalars at the moment. But if that turns out to be difficult, we can talk about it.