finch-tensor / finch-tensor-python

Sparse and Structured Tensor Programming in Python
MIT License
8 stars 3 forks source link

API: Add fully-dense no-copy and `scipy/numpy` input `Tensor` constructors #10

Closed mtsokol closed 8 months ago

mtsokol commented 8 months ago

Hi @willow-ahrens @hameerabbasi,

This PR contains:

  1. finch.Tensor(scipy.sparse.spmatrix) and finch.Tensor(numpy.ndarray) constructors.
  2. Support for no-copy constructor for fully dense Tensors.
  3. levels.py and utils.py files to clean up the code a bit.
  4. Storage class for device support.
willow-ahrens commented 8 months ago

Shouldn't we also have Tensor(arr), where arr is numpy or scipy array and we do a no-copy if possible

willow-ahrens commented 8 months ago

like, shouldn't Tensor(np.Array) do the no-copy?

hameerabbasi commented 8 months ago

Shouldn't we also have Tensor(arr), where arr is numpy or scipy array and we do a no-copy if possible

Since Python doesn't have function overloading, this is considered a bad practice. I'd recommend type.from_numpy and type.from_scipy_sparse instead.

mtsokol commented 8 months ago

I think no-copy from_numpy that gives a dense Tensor also makes sense.

mtsokol commented 8 months ago

like, shouldn't Tensor(np.Array) do the no-copy?

Right, that makes more sense (passing arr only) than building manually this lvl structure.

willow-ahrens commented 8 months ago

@hameerabbasi Could we use a dispatch pattern like:

def Tensor(arr):
    arr.to_finch_tensor()

Or maybe something like:

no_copy_constructors = {numpy.Array: from_numpy, scipy.CSR: from_scipy_CSR}
def Tensor(arr):
    no_copy_constructors[typeof(arr)](arr)
willow-ahrens commented 8 months ago

@hameerabbasi I mean, we don't do add_numpy_and_scipy_arrays or add_numpy_and_numpy_arrays separately, I think there's a similar pattern used there right?

hameerabbasi commented 8 months ago

There can only be one variant of the __init__, so what would need to be done is to "reinterpret" the first argument, with an "incorrect" arg name.

I mean, we don't do add_numpy_and_scipy_arrays or add_numpy_and_numpy_arrays separately, I think there's a similar pattern used there right?

Well, scipy arrays just error on mixed ops, unless I'm mistaken.

willow-ahrens commented 8 months ago

I think this runs up against https://data-apis.org/array-api/2022.12/design_topics/static_typing.html

willow-ahrens commented 8 months ago

though I don't see it mentioned anywhere else in the standard

willow-ahrens commented 8 months ago

There can only be one variant of the init, so what would need to be done is to "reinterpret" the first argument, with an "incorrect" arg name.

I think what we're talking about here is just the case where arr is not None but lvl is indeed None

Well, scipy arrays just error on mixed ops, unless I'm mistaken.

I feel like there's a real opportunity here to design a simple, extensible system, perhaps with my dictionary approach above, that can easily put all of these different conversions from Numpy, ScipySparse, pydata Sparse and all the formats each supports through the same constructor function, and then just print a helpful error message when we have to densify to convert

hameerabbasi commented 8 months ago

I think what we're talking about here is just the case where arr is not None but lvl is indeed None

An alternative here would be positional-only arguments, like def __init__(arr, /, lvl, ...). That way the user can't specify arr as a keyword and we're free to reinterpret it.

willow-ahrens commented 8 months ago

How does the following know which + to call?

>>> A = scipy.sparse.random(5, 5, 0.5)
>>> B = numpy.ones((5, 5))
>>> A + B
matrix([[1.        , 1.        , 1.        , 1.        , 1.49019306],
        [1.39644749, 1.        , 1.        , 1.        , 1.87599184],
        [1.85115063, 1.        , 1.97588302, 1.47926403, 1.        ],
        [1.65714027, 1.        , 1.        , 1.69535585, 1.        ],
        [1.        , 1.34376335, 1.00957799, 1.50217521, 1.19109791]])
>>> B + A
matrix([[1.        , 1.        , 1.        , 1.        , 1.49019306],
        [1.39644749, 1.        , 1.        , 1.        , 1.87599184],
        [1.85115063, 1.        , 1.97588302, 1.47926403, 1.        ],
        [1.65714027, 1.        , 1.        , 1.69535585, 1.        ],
        [1.        , 1.34376335, 1.00957799, 1.50217521, 1.19109791]])
hameerabbasi commented 8 months ago

How does the following know which + to call?

Right, so there's an __add__ and __radd__, and a sentinel NotImplemented is returned whenever we don't know how to handle the specific types.

willow-ahrens commented 8 months ago

Hameer, I think I'm misunderstanding your opposition to allowing Tensor(arr=np.Array) to use the type of arr to figure out which conversion to call. Are you opposed to this? There are many ways we could design it to make it extensible, and maybe there's even an existing convert functionality in python we can use?

willow-ahrens commented 8 months ago

Python already has many such methods, like float(x) which attempts to convert many different kinds of types to a float.

hameerabbasi commented 8 months ago

Hameer, I think I'm misunderstanding your opposition to allowing Tensor(arr=np.Array) to use the type of arr to figure out which conversion to call.

I'm not opposing it per se, just saying we should disallow Tensor(arr=some_array), and instead do Tensor(some_array) instead. I'm okay with the dictionary approach.

willow-ahrens commented 8 months ago

Oh, I see. In that case, what are the full set of constructors you envision?

hameerabbasi commented 8 months ago

Here's a small stub:

class Tensor:
    def __init__(self, arr, /, *, levels=None, copy: bool | None=None):
        """Permissive constructor."""
        ...

    @classmethod
    def from_numpy(cls, arr, /, copy: bool | None=None):
        """Only accepts NumPy arrays"""

    @classmethod
    def from_scipy_sparse(cls, arr, /, copy: bool | None=None):
        """Only accepts SciPy sparse arrays"""

copy=None would be permissive, zero-copy where possible. copy=False prohibits copying, and copy=True would force a copy.

willow-ahrens commented 8 months ago

I see. One challenge with this approach is that we might want to specify lvls and not arr

hameerabbasi commented 8 months ago

I see. One challenge with this approach is that we might want to specify lvls and not arr

Do you see a case where we might need to specify both?

willow-ahrens commented 8 months ago

yes. We specify both when we want to copy arr into a tensor with storage lvls. see https://github.com/willow-ahrens/finch-tensor/blob/183db78289e1d571128586e62ad0dee450d9e865/src/finch/tensor.py#L19-L43

hameerabbasi commented 8 months ago

That, I feel is going a bit far -- Like we're overloading one function with way too much functionality, which may be confusing for the average user. However, if we really want to do it in __init__, I'd suggest the signature __init__(self, /, *, arrs=None, levels=None), that way we have to be explicit and specify both as keywords.

An alternative would be to make a zero-copy version, then have the format conversion as a separate step: Tensor(arr).asformat(levels=..., order=...).

willow-ahrens commented 8 months ago

I want to do whatever feels "pythonic". what example constructors can we draw inspiration from? So far, I have https://numpy.org/doc/stable/reference/generated/numpy.array.html#numpy.array and https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_array.html#scipy.sparse.csr_array, both of which seem to be able to construct from different argument types.

There's also of course https://sparse.pydata.org/en/stable/construct.html, with different constructors for each input, which I'd like to avoid because we can just figure it out from the type of the argument.

Note that in scipy, csr_array(S) is s.to_csr() when supported.

hameerabbasi commented 8 months ago

I want to do whatever feels "pythonic". what example constructors can we draw inspiration from?

I guess different formats are different types, but I'm guessing here we want to mush them into one type? One option is per-format types/classes, via metaclasses, another is to have one type expose a .levels or .format property, and do format conversions via .asformat(...)

As far as Python array API support goes, .device can be "overloaded" with functionality to add extra information about the format somehow, see the page on Array API support. There is precedent for this, e.g. Dask sets the device to a Lazy device if the task graph hasn't been executed.

willow-ahrens commented 8 months ago

From the meeting, here's the way we'll implement constructors:

import numpy as np

np.empty((11, 22, 33), dtype = np.float64, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))

A = scipy.sparse.csr_matrix(...)

B = Tensor(Storage(Dense(Sparse(Element(0.0)))), order = “F”)
B = Tensor(DenseStorage(order = “F”))
B = Tensor(CSFStorage(order = “F”))
C = Tensor(A) #
reduce(B * C, axis = 1)
@compile function foo(B, C)
    reduce(B * C, axis = 1)
end
D = finch.asarray(A, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))
E = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “C”)) # Transposes, no zero-copy conversion
F = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “F”)) # Zero-copy conversion 
F = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “F”), copy = true) #copies, but same format

sparse.asarray(Storage(Dense(Sparse(Element(0.0))), order = “F”))
Tensor(scipy.sparse.csr_matrix(...))

We also agreed to use an ifelse to dispatch on numpy, scipy, etc..., with a __finch_tensor__ fallback protocol.

mtsokol commented 8 months ago

Hi @willow-ahrens @hameerabbasi,

I pushed latest changes from today and this PR currently contains:

  1. finch.Tensor(scipy.sparse.spmatrix) constructor and finch.Tensor(numpy.ndarray) no-copy constructor.
  2. Support for no-copy constructor for fully dense Tensors.
  3. levels.py and utils.py files to clean up the code a bit.
  4. Storage class for device/formats support.
  5. Implementation of to_device.

Tomorrow I will bump Finch.jl version to make SciPy constructor also non-copying, add asarray and some tests to cover all cases from the last comment.

You can take a look at tests to see how it looks right now, e.g.:

a = Tensor(np_array)  # no copy
Tensor(np_array).to_device(Storage(Dense(SparseList(Element(0.0)))))

I made finch.Tensor a function to allow constructing both Tensors and COO/CSC/CSR classes from the same "constructor". Please share your feedback!

willow-ahrens commented 8 months ago

I guess, one question I'm left with is why the format aliases aren't just functions. e.g. finch.csr_storage() or finch.csr_tensor(). We could keep the function signature the same as Tensor, but just drop the lvl or order arguments accordingly.

willow-ahrens commented 8 months ago

One question also arises: Is Storage serving the same role as Tensor? There are some cases where we pass order to Storage and others where we pass it to Tensor. I would expect us to choose between one of the following:

  1. 
    np.empty((11, 22, 33), dtype = np.float64, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))

B = Tensor(Storage(Dense(Sparse(Element(0.0))), order = “F”)) B = Tensor(dense_storage(2, order = “F”)) C = Tensor(A) D = finch.asarray(A, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))


2. 

np.empty((11, 22, 33), dtype = np.float64, device = Tensor(Dense(Sparse(Element(0.0))), order = “F”))

B = Tensor(Dense(Sparse(Element(0.0)), order = “F”) B = dense_tensor(2, order = “F”) C = Tensor(A) D = finch.asarray(A, device = Tensor(Dense(Sparse(Element(0.0))), order = “F”))



What I'm seeing in the PR feels like a mixture of these two and I'm confused.
mtsokol commented 8 months ago

There are some cases where we pass order to Storage and others where we pass it to Tensor.

Good point - let's keep order as a part of Storage only.

mtsokol commented 8 months ago

Ok, no-copy SciPy constructors with tests are completed. I applied most of the review comments, as removing CSC, CSR classes (the only one left is to make order present only in Storage object).

mtsokol commented 8 months ago

I guess, one question I'm left with is why the format aliases aren't just functions. e.g. finch.csr_storage() or finch.csr_tensor(). We could keep the function signature the same as Tensor, but just drop the lvl or order arguments accordingly.

Good point - now it's as finch.Tensor.construct_csr((data, indices, indptr), shape) to match SciPy API.

mtsokol commented 8 months ago

Would it be possible to move Storage from the keyword to a single positional argument? Consider the difference between ...

Sure! I can make Tensor() accept only one positional argument. But then I need to get rid of Tensor(arr=Tensor, storage=storage) constructor, which is used in e.g. to_device implementation:

def to_device(self, device: Storage) -> "Tensor":
    return Tensor(arr=self, storage=device)

But I can make this constructor a class method, it's mostly used internally here as public API recommends rather Tensor(arr).to_device(storage).

willow-ahrens commented 8 months ago

Oh, I see the point of the keyword. I suppose theres also finch.asarray(arr, device=Storage(...)). But I agree, the to_device approach is a better way to achieve that.

mtsokol commented 8 months ago

Oh, I see the point of the keyword. I suppose theres also finch.asarray(arr, device=Storage(...)). But I agree, the to_device approach is a better way to achieve that.

Sure! Now Tensor() accepts only one positional argument obj and proceeds according to its type. For now, instead of asarray, I added _from_other_tensor() as I want to implement Array API constructor functions separately, if it's Ok.

The PR is again ready from my side!