daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
65 stars 58 forks source link

Kernel for ctable operation #390

Closed pdamme closed 2 years ago

pdamme commented 2 years ago

This task is to implement a kernel for DaphneIR's CTableOp. The contingency table of two (n x 1) matrices lhs and rhs is a matrix res of shape (max(lhs) + 1, max(rhs) + 1), and res[i, j] = |{ k | A[k] = i and B[k] = j, 0 ≤ k ≤ n-1 }|.

Example:

lhs: 2
     4
     5
     4

rhs: 0
     3
     1
     3

res: 0 0 0 0
     0 0 0 0
     1 0 0 0
     0 0 0 0
     0 0 0 2
     0 1 0 0

We need:

Note that this operation is inspired by SystemDS's table() built-in function (see https://apache.github.io/systemds/site/dml-language-reference). However, its definition is slightly different here, since we start counting rows/columns at zero (not at one).

We currently need this kernel to make the PCA DaphneDSL script in scripts/algorithms/pca.daph run.

akroviakov commented 2 years ago

Assigned to @akroviakov.

akroviakov commented 2 years ago

@pdamme Shouldn't the resulting matrix per definition be the following:

res:  0 0 0 0
      0 0 0 0
      0 0 0 0
      0 0 0 0
      0 3 0 1
      0 2 0 0

e.g., res[4, 3] = 1 for k=1, since A[1] = 4 and B[1] = 3? Also, res gets filled with k values, what could be the reason for k to start at 0? Should the loop explicitly start from 0 in case of already zeroed DenseMatrix in constructor, should some coordinate in CSRMatrix be explicitly set to 0?

pdamme commented 2 years ago

If I'm not mistaken, my initial example should match the definition... (BTW: A and B should be lhs and rhs, respectively.) There are three distinct pairs of corresponding elements in lhs and rhs: (2, 0), (4, 3), and (5, 1). So these are the positions of non-zero elements in the result. (2, 0) and (5, 1) occur just once, but (4, 3) occurs twice, thus res[2, 0] == 1, res[4, 3] == 2, and res[5, 1] == 1.

Counting starts at zero refers to the positions within a matrix. In DAPHNE, the first row is row 0, while in SystemDS it is row 1. Same for columns.

If you zero the DenseMatrix on creation, you can simply set() the individual elements. For CSRMatrix you should not store zeros explicitly.

I hope that answers your question...

akroviakov commented 2 years ago

Yes sorry, my mistake, I've misinterpreted the brackets in the definition.