daphne-eu / daphne

DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Apache License 2.0
67 stars 62 forks source link

DaphneDSL slicing semantics #871

Closed philipportner closed 1 month ago

philipportner commented 1 month ago

I ran into some problems debugging #865 which lead me to commit cad4ac38990eb2e72ff61257d42e81f0ee8e7312 and 734655c49a2c7bc2b3ced7e08bff2fc6d73930f9 . In these commits I added a throw exception for slicing operations that would result in a zero row or column matrix, DenseMatrix(0x4, int64_t) or DenseMatrix(4x0, int64_t).

Apparently, I missed the TAG_INDEXING tests, and according to those we explicitly support this.

Is this intended? Do we actually want to support data structures with no rows, cols, or neither?

X = fill(1, 4, 4);
X = X[1:1,1:1];
print(X - 1);

Evaluates to

DenseMatrix(0x0, int64_t)

This behavior is mimicking pythons numpy behavior.

I'll revert the two commits from above, but I'd like some clarification on how DAPHNE wants to handle this.

Reason for why I'm looking into this, is a DaphneDSL script that runs fine when run without --vec, but crashes with --vec (#865). But it runs "fine" because this zero-length indexing silently happens in a nested loop which modifies the slice indexes. When run without --vec this does not appear to cause a problem, but with --vec we try to vectorize a pipeline with an input matrix that has no rows, causing the error in #865.

Should this be handled by DAPHNE, by being able to handle matrices with possibly multiple zero-length dimensions? The same way numpy does it? I'm not sure if all kernels are written in this way, some explicitly validate arguments, like SliceRow.h. Besides normal operator kernels, --vec would need to be modified, but, not sure how that would make any sense there.

Or should we not allow such objects and make this an error that has to be fixed by the user?

pdamme commented 1 month ago

Yes, any dimension of a DAPHNE data object may be zero. A trivial example where this policy makes sense is a relational selection with an empty result: we must be able to represent that the result has zero rows (while the number of columns, or the concrete schema, is still important, because it determines compatibility for possible subsequent operations like rbind() or set operations). There are also some operations which may return an "empty" data object with zero rows and/or columns, e.g., fill(), rand(), seq(), ctable(), and oneHot(). Consistently, right indexing (Slice[Row|Col]Op) should support empty results, too; note also that the inclusive lower bound and exclusive upper bound may not be known before run-time.

In general, I think not supporting zero row/column data objects would be an unnecessary limitation. At the same time, I don't think that the unit test cases for the kernels systematically test the behavior for zero row/column inputs or results. Something we should improve in the future.

Regarding the vectorized engine, I would expect that a zero-row input data object (that shall be splitted into row segments) will lead to an empty list of tasks...

philipportner commented 1 month ago

Thanks @pdamme. I'll create issues for the code that is not able to handle zero-length dimensions.