Closed mtsokol closed 8 months ago
Shouldn't we also have Tensor(arr)
, where arr is numpy or scipy array and we do a no-copy if possible
like, shouldn't Tensor(np.Array)
do the no-copy?
Shouldn't we also have
Tensor(arr)
, where arr is numpy or scipy array and we do a no-copy if possible
Since Python doesn't have function overloading, this is considered a bad practice. I'd recommend type.from_numpy
and type.from_scipy_sparse
instead.
I think no-copy from_numpy
that gives a dense Tensor also makes sense.
like, shouldn't
Tensor(np.Array)
do the no-copy?
Right, that makes more sense (passing arr
only) than building manually this lvl
structure.
@hameerabbasi Could we use a dispatch pattern like:
def Tensor(arr):
arr.to_finch_tensor()
Or maybe something like:
no_copy_constructors = {numpy.Array: from_numpy, scipy.CSR: from_scipy_CSR}
def Tensor(arr):
no_copy_constructors[typeof(arr)](arr)
@hameerabbasi I mean, we don't do add_numpy_and_scipy_arrays
or add_numpy_and_numpy_arrays
separately, I think there's a similar pattern used there right?
There can only be one variant of the __init__
, so what would need to be done is to "reinterpret" the first argument, with an "incorrect" arg name.
I mean, we don't do
add_numpy_and_scipy_arrays
oradd_numpy_and_numpy_arrays
separately, I think there's a similar pattern used there right?
Well, scipy arrays just error on mixed ops, unless I'm mistaken.
I think this runs up against https://data-apis.org/array-api/2022.12/design_topics/static_typing.html
though I don't see it mentioned anywhere else in the standard
There can only be one variant of the init, so what would need to be done is to "reinterpret" the first argument, with an "incorrect" arg name.
I think what we're talking about here is just the case where arr
is not None but lvl
is indeed None
Well, scipy arrays just error on mixed ops, unless I'm mistaken.
I feel like there's a real opportunity here to design a simple, extensible system, perhaps with my dictionary approach above, that can easily put all of these different conversions from Numpy, ScipySparse, pydata Sparse and all the formats each supports through the same constructor function, and then just print a helpful error message when we have to densify to convert
I think what we're talking about here is just the case where
arr
is not None butlvl
is indeed None
An alternative here would be positional-only arguments, like def __init__(arr, /, lvl, ...)
. That way the user can't specify arr
as a keyword and we're free to reinterpret it.
How does the following know which + to call?
>>> A = scipy.sparse.random(5, 5, 0.5)
>>> B = numpy.ones((5, 5))
>>> A + B
matrix([[1. , 1. , 1. , 1. , 1.49019306],
[1.39644749, 1. , 1. , 1. , 1.87599184],
[1.85115063, 1. , 1.97588302, 1.47926403, 1. ],
[1.65714027, 1. , 1. , 1.69535585, 1. ],
[1. , 1.34376335, 1.00957799, 1.50217521, 1.19109791]])
>>> B + A
matrix([[1. , 1. , 1. , 1. , 1.49019306],
[1.39644749, 1. , 1. , 1. , 1.87599184],
[1.85115063, 1. , 1.97588302, 1.47926403, 1. ],
[1.65714027, 1. , 1. , 1.69535585, 1. ],
[1. , 1.34376335, 1.00957799, 1.50217521, 1.19109791]])
How does the following know which + to call?
Right, so there's an __add__
and __radd__
, and a sentinel NotImplemented
is returned whenever we don't know how to handle the specific types.
Hameer, I think I'm misunderstanding your opposition to allowing Tensor(arr=np.Array)
to use the type of arr
to figure out which conversion to call. Are you opposed to this? There are many ways we could design it to make it extensible, and maybe there's even an existing convert functionality in python we can use?
Python already has many such methods, like float(x)
which attempts to convert many different kinds of types to a float.
Hameer, I think I'm misunderstanding your opposition to allowing
Tensor(arr=np.Array)
to use the type ofarr
to figure out which conversion to call.
I'm not opposing it per se, just saying we should disallow Tensor(arr=some_array)
, and instead do Tensor(some_array)
instead. I'm okay with the dictionary approach.
Oh, I see. In that case, what are the full set of constructors you envision?
Here's a small stub:
class Tensor:
def __init__(self, arr, /, *, levels=None, copy: bool | None=None):
"""Permissive constructor."""
...
@classmethod
def from_numpy(cls, arr, /, copy: bool | None=None):
"""Only accepts NumPy arrays"""
@classmethod
def from_scipy_sparse(cls, arr, /, copy: bool | None=None):
"""Only accepts SciPy sparse arrays"""
copy=None
would be permissive, zero-copy where possible. copy=False
prohibits copying, and copy=True
would force a copy.
I see. One challenge with this approach is that we might want to specify lvls
and not arr
I see. One challenge with this approach is that we might want to specify
lvls
and notarr
Do you see a case where we might need to specify both?
yes. We specify both when we want to copy arr
into a tensor with storage lvls
. see https://github.com/willow-ahrens/finch-tensor/blob/183db78289e1d571128586e62ad0dee450d9e865/src/finch/tensor.py#L19-L43
That, I feel is going a bit far -- Like we're overloading one function with way too much functionality, which may be confusing for the average user. However, if we really want to do it in __init__
, I'd suggest the signature __init__(self, /, *, arrs=None, levels=None)
, that way we have to be explicit and specify both as keywords.
An alternative would be to make a zero-copy version, then have the format conversion as a separate step: Tensor(arr).asformat(levels=..., order=...)
.
I want to do whatever feels "pythonic". what example constructors can we draw inspiration from? So far, I have https://numpy.org/doc/stable/reference/generated/numpy.array.html#numpy.array and https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_array.html#scipy.sparse.csr_array, both of which seem to be able to construct from different argument types.
There's also of course https://sparse.pydata.org/en/stable/construct.html, with different constructors for each input, which I'd like to avoid because we can just figure it out from the type of the argument.
Note that in scipy, csr_array(S)
is s.to_csr()
when supported.
I want to do whatever feels "pythonic". what example constructors can we draw inspiration from?
I guess different formats are different types, but I'm guessing here we want to mush them into one type? One option is per-format types/classes, via metaclasses, another is to have one type expose a .levels
or .format
property, and do format conversions via .asformat(...)
As far as Python array API support goes, .device
can be "overloaded" with functionality to add extra information about the format somehow, see the page on Array API support. There is precedent for this, e.g. Dask sets the device to a Lazy device if the task graph hasn't been executed.
From the meeting, here's the way we'll implement constructors:
import numpy as np
np.empty((11, 22, 33), dtype = np.float64, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))
A = scipy.sparse.csr_matrix(...)
B = Tensor(Storage(Dense(Sparse(Element(0.0)))), order = “F”)
B = Tensor(DenseStorage(order = “F”))
B = Tensor(CSFStorage(order = “F”))
C = Tensor(A) #
reduce(B * C, axis = 1)
@compile function foo(B, C)
reduce(B * C, axis = 1)
end
D = finch.asarray(A, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))
E = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “C”)) # Transposes, no zero-copy conversion
F = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “F”)) # Zero-copy conversion
F = D.to_device(Storage(Dense(Sparse(Element(0.0))), order = “F”), copy = true) #copies, but same format
sparse.asarray(Storage(Dense(Sparse(Element(0.0))), order = “F”))
Tensor(scipy.sparse.csr_matrix(...))
We also agreed to use an ifelse to dispatch on numpy, scipy, etc..., with a __finch_tensor__
fallback protocol.
Hi @willow-ahrens @hameerabbasi,
I pushed latest changes from today and this PR currently contains:
finch.Tensor(scipy.sparse.spmatrix)
constructor and finch.Tensor(numpy.ndarray)
no-copy constructor.levels.py
and utils.py
files to clean up the code a bit. Storage
class for device/formats support.to_device
.Tomorrow I will bump Finch.jl
version to make SciPy constructor also non-copying, add asarray
and some tests to cover all cases from the last comment.
You can take a look at tests to see how it looks right now, e.g.:
a = Tensor(np_array) # no copy
Tensor(np_array).to_device(Storage(Dense(SparseList(Element(0.0)))))
I made finch.Tensor
a function to allow constructing both Tensors and COO/CSC/CSR classes from the same "constructor". Please share your feedback!
I guess, one question I'm left with is why the format aliases aren't just functions. e.g. finch.csr_storage()
or finch.csr_tensor()
. We could keep the function signature the same as Tensor
, but just drop the lvl
or order
arguments accordingly.
One question also arises: Is Storage
serving the same role as Tensor
? There are some cases where we pass order
to Storage
and others where we pass it to Tensor
. I would expect us to choose between one of the following:
np.empty((11, 22, 33), dtype = np.float64, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))
B = Tensor(Storage(Dense(Sparse(Element(0.0))), order = “F”)) B = Tensor(dense_storage(2, order = “F”)) C = Tensor(A) D = finch.asarray(A, device = Storage(Dense(Sparse(Element(0.0))), order = “F”))
2.
np.empty((11, 22, 33), dtype = np.float64, device = Tensor(Dense(Sparse(Element(0.0))), order = “F”))
B = Tensor(Dense(Sparse(Element(0.0)), order = “F”) B = dense_tensor(2, order = “F”) C = Tensor(A) D = finch.asarray(A, device = Tensor(Dense(Sparse(Element(0.0))), order = “F”))
What I'm seeing in the PR feels like a mixture of these two and I'm confused.
There are some cases where we pass order to Storage and others where we pass it to Tensor.
Good point - let's keep order
as a part of Storage
only.
Ok, no-copy SciPy constructors with tests are completed. I applied most of the review comments, as removing CSC
, CSR
classes (the only one left is to make order
present only in Storage
object).
I guess, one question I'm left with is why the format aliases aren't just functions. e.g.
finch.csr_storage()
orfinch.csr_tensor()
. We could keep the function signature the same asTensor
, but just drop thelvl
ororder
arguments accordingly.
Good point - now it's as finch.Tensor.construct_csr((data, indices, indptr), shape)
to match SciPy API.
Would it be possible to move Storage from the keyword to a single positional argument? Consider the difference between ...
Sure! I can make Tensor()
accept only one positional argument. But then I need to get rid of Tensor(arr=Tensor, storage=storage)
constructor, which is used in e.g. to_device
implementation:
def to_device(self, device: Storage) -> "Tensor":
return Tensor(arr=self, storage=device)
But I can make this constructor a class method, it's mostly used internally here as public API recommends rather Tensor(arr).to_device(storage)
.
Oh, I see the point of the keyword. I suppose theres also finch.asarray(arr, device=Storage(...))
. But I agree, the to_device
approach is a better way to achieve that.
Oh, I see the point of the keyword. I suppose theres also
finch.asarray(arr, device=Storage(...))
. But I agree, theto_device
approach is a better way to achieve that.
Sure! Now Tensor()
accepts only one positional argument obj
and proceeds according to its type.
For now, instead of asarray
, I added _from_other_tensor()
as I want to implement Array API constructor functions separately, if it's Ok.
The PR is again ready from my side!
Hi @willow-ahrens @hameerabbasi,
This PR contains:
finch.Tensor(scipy.sparse.spmatrix)
andfinch.Tensor(numpy.ndarray)
constructors.levels.py
andutils.py
files to clean up the code a bit.Storage
class for device support.