Open TomAugspurger opened 3 years ago
Do we have someone representing us on the array side? If not, I'd be happy to volunteer. Though would need to get plugged into the meetings somehow
cc @rgommers for visibility
Thanks @jakirkham! So far it's been mainly @TomAugspurger and @shoyer who contributed Dask expertise. Would be great to have you join. Also happy to have a call soon to catch you up on what's been happening. I'll try and collect a few of the Dask-related points (there weren't many at all in the past few months) and will get back to you.
Also, let me point out the NumPy Enhancement Proposal for adoption of the array API standard: https://github.com/numpy/numpy/pull/18456
Sounds good. Happy to do some reading as well (guessing there's lots to catch up on). That would be great. Thank you Ralf! 😄
Yep saw that. Will take a look at that as well 🙂
The one issue with an unresolved Dask-related angle is https://github.com/data-apis/array-api/issues/97 (related to data/value-dependent output shapes and how to represent "unknown shape" - Dask uses nan
and TensorFlow None
).
https://github.com/data-apis/array-api/issues/84 on including boolean indexing that's hard for Dask to support may also be good to review. Discussed concluded with color-coding that indexing mode as "not all libraries may support this, it's hard in combination with delayed evaluation and JIT compilation" (not implemented yet, that's why the issue is still open).
Other than that, I'd suggest looking at https://data-apis.github.io/array-api/latest/ and https://numpy.org/neps/nep-0047-array-api-standard.html for things that are suboptimal or missing from Dask's perspective.
Commented on the nan
issue. For context this was added in PR ( https://github.com/dask/dask/pull/1838 ). Maybe it can change, but to set expectations it is likely a decent lift and error prone change. Additionally other maintainers may be wary of such a change as it could be a source of bugs going forward. Admittedly maybe other libraries feel similarly depending on how heavily they use this feature. Dask is fairly reliant on it as you have seen with masking, raveling, or other operations where the content of the array may affect the result (unique
).
Dask does support a fair bit of boolean indexing (though we may find some edge cases that don't work). However as the mask itself may have unknown values, this falls in the category of things where the array's content affects the array shape and so the shape of the array is unknown. I'm not sure I'd write this off as not supported, but it's also possible I don't understand what supported would mean in the context of the spec or the cases you have in mind. So more context here would be welcome
Ok will read up on this before we talk more 🙂
Edit: It would be good to discuss ravel
somewhere. Have held off opening an issue, but I think trying to match NumPy too closely here is a pain point for Dask. If we could come up with a sensible way to relax that, which we agree would be acceptable, some (though certainly not all) unknown shape issues would go away. This would also offer a performance boost for users. See issue ( https://github.com/dask/dask/issues/4855 ) for more details
Also related to Data API framework. Generally we reach for and write functions to check if they are generally an Array, DataFrame, Series, etc. ( https://github.com/dask/dask/issues/7327 ) Not sure if there are general tools scoped out in this framework to make those checks easier, but that would be very helpful for Dask
Interesting. AFAIK there is not yet a way to distinguish a Data API array from a Data API dataframe. But since the latter is still in a draft status (more "drafty" than the former, as it's not yet made public), perhaps we can discuss about it?
Edit: It would be good to discuss
ravel
somewhere.
There is no ravel
function. The NumPy API is messy in this regard: ravel
, flatten
, flat
, and reshape
are all similar, with some random differences like returning a copy vs. a view (which cannot be a distinguishing feature for the API standard, views are an implementation detail). The standard just has reshape
; reshape(-1)
is equivalent to ravel
.
Generally we reach for and write functions to check if they are generally an Array, DataFrame, Series, etc. ( dask/dask#7327 ) Not sure if there are general tools scoped out in this framework to make those checks easier, but that would be very helpful for Dask
Arrays which support the standard, or from which you can at least retrieve the standard namespace with a compliant implementation, have a __array_namespace__
attribute. That should be the easiest way to detect an array. For static typing we may develop a protocol (see https://data-apis.org/array-api/latest/design_topics/static_typing.html), but given that there's no plan to have a shared reference library that other libraries can depend on, there's no good way to design an isinstance
check. Which is fine I think, a simple utility function that uses hasattr(x, '__array_namespace__')
should be fine.
OK. With the ongoing discussion with the dataframe API, a similar check can be done for dataframes: hasattr(x, '__dataframe__')
. But, it's important to note that neither attributes __array_namespace__
and __dataframe__
can have any side effect, or it'd be triggered when calling hasattr
. Not sure if it's possible to enforce this requirement in the Array/Dataframe API Standard...
But, it's important to note that neither attributes
__array_namespace__
and__dataframe__
can have any side effect, or it'd be triggered when callinghasattr
. Not sure if it's possible to enforce this requirement in the Array/Dataframe API Standard...
Already enforced by them being methods I think - attributes can easily have side effects, methods can't. Of course Python is dynamic in nature, so you could come up with some hack that has an unwanted side effect, but it's highly unlikely to occur in practice.
If folks wanted to start using the compliance suite array-api-tests
, you might like to check out this simple wrapper that implements some crucial things most tests require. I wasn't sure if dask.array
was also going to be your Array API namespace ala PyTorch—that's what I assumed with this at least. https://gist.github.com/honno/70623811b88c0f5b224c668ec4ed86b7
Generally, please feel free to reach out via issues for any questions/bugs. We want to make the suite easier to use for implementors, so any feedback would be appreciated.
Thanks @honno for providing the wrapper and the test suite.
There are likely to be some compatibility issues with using dask.array
as the Array API namespace. To pick one example, in Dask the signature for eye
is
eye(N, chunks='auto', M=None, k=0, dtype=<class 'float'>)
whereas in the Array API it is
eye(n_rows, n_cols, /, *, k=0 dtype=None, device=None)
This means that an expression like eye(100, 10)
means two different things: in Dask 10 refers to the chunk size, but in the Array API it's the number of columns (M). It's not clear if these can be made compatible.
It might be sensible therefore to use a different module, like dask.array_api
, just like NumPy, CuPy, and MXNet do. As a proof of concept I've implemented a wrapper to do that in this branch. (File naming and structure is based on numpy.array_api
.) The array object is a regular dask.array.Array
though.
Running the test suite I get:
ARRAY_API_TESTS_MODULE=dask.array_api pytest --max-examples=2
[snip]
====================================================================== short test summary info =======================================================================
FAILED array_api_tests/test_array_object.py::test_getitem_masking - IndexError: Too many indices for array
FAILED array_api_tests/test_creation_functions.py::test_eye - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_creation_functions.py::test_linspace - ZeroDivisionError: float division by zero
FAILED array_api_tests/test_data_type_functions.py::test_can_cast - AssertionError: out=True, but should be False [can_cast(bool, int8)]
FAILED array_api_tests/test_linalg.py::test_tensordot - AssertionError: out.dtype=uint64, but should be uint8 [tensordot(uint8, uint8)]
FAILED array_api_tests/test_manipulation_functions.py::test_concat - TypeError: '<' not supported between instances of 'NoneType' and 'int'
FAILED array_api_tests/test_operators_and_elementwise_functions.py::test_positive[positive] - IndexError: Too many indices for array
FAILED array_api_tests/test_operators_and_elementwise_functions.py::test_positive[__pos__] - IndexError: Too many indices for array
FAILED array_api_tests/test_searching_functions.py::test_argmax - ValueError: At least one iterator operand must be non-NULL
FAILED array_api_tests/test_searching_functions.py::test_argmin - ValueError: At least one iterator operand must be non-NULL
FAILED array_api_tests/test_searching_functions.py::test_nonzero - AssertionError: out[0].size=1, but should be out[0].size=nan
FAILED array_api_tests/test_set_functions.py::test_unique_all - AssertionError: out.indices.shape=(nan,), but should be out.values.shape=(nan,)
FAILED array_api_tests/test_set_functions.py::test_unique_counts - AssertionError: out.counts.shape=(nan,), but should be out.values.shape=(nan,)
FAILED array_api_tests/test_set_functions.py::test_unique_inverse - TypeError: 'float' object cannot be interpreted as an integer
FAILED array_api_tests/test_set_functions.py::test_unique_values - TypeError: 'float' object cannot be interpreted as an integer
FAILED array_api_tests/test_signatures.py::test_has_names[__array_namespace__] - AssertionError: The array object is missing the method __array_namespace__()
FAILED array_api_tests/test_signatures.py::test_function_positional_args[eye] - AssertionError: Unexpected exception TypeError('dtype must be known for auto-chunki...
FAILED array_api_tests/test_signatures.py::test_function_positional_args[matmul] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[matrix_transpose] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[tensordot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[vecdot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[eye] - AssertionError: Unexpected exception TypeError('dtype must be known for auto-chun...
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[matmul] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[matrix_transpose] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[tensordot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[vecdot] - TypeError: dtype must be known for auto-chunking
================================================ 26 failed, 3544 passed, 541 skipped, 5 warnings in 83.74s (0:01:23) =================================================
So: a few failures, as might be expected, but also lots of tests passing, which is very encouraging.
A few caveats: I ran with only a maximum of two hypothesis examples just to do a quick check at this stage. Increasing it would almost certainly find more failing cases. I also used the following skips.txt
to exclude functions that are known to be missing in Dask (and also special cases which are failing).
# expand_dims is not implemented in Dask
array_api_tests/test_manipulation_functions.py::test_expand_dims
array_api_tests/test_signatures.py::test_has_names[expand_dims]
# sorting is not implemented in Dask (https://github.com/dask/dask/issues/4368)
array_api_tests/test_sorting_functions.py::test_argsort
array_api_tests/test_sorting_functions.py::test_sort
array_api_tests/test_signatures.py::test_has_names[argsort]
array_api_tests/test_signatures.py::test_has_names[sort]
# in-place array operations are not implemented in Dask (https://github.com/dask/dask/issues/5199)
array_api_tests/test_signatures.py::test_has_names[__iadd__]
array_api_tests/test_signatures.py::test_has_names[__iand__]
array_api_tests/test_signatures.py::test_has_names[__ifloordiv__]
array_api_tests/test_signatures.py::test_has_names[__ilshift__]
array_api_tests/test_signatures.py::test_has_names[__imatmul__]
array_api_tests/test_signatures.py::test_has_names[__imod__]
array_api_tests/test_signatures.py::test_has_names[__imul__]
array_api_tests/test_signatures.py::test_has_names[__ior__]
array_api_tests/test_signatures.py::test_has_names[__ipow__]
array_api_tests/test_signatures.py::test_has_names[__irshift__]
array_api_tests/test_signatures.py::test_has_names[__isub__]
array_api_tests/test_signatures.py::test_has_names[__itruediv__]
array_api_tests/test_signatures.py::test_has_names[__ixor__]
# DLPack is not implemented in Dask
array_api_tests/test_signatures.py::test_has_names[__dlpack__]
array_api_tests/test_signatures.py::test_has_names[__dlpack_device__]
array_api_tests/test_signatures.py::test_has_names[from_dlpack]
# device support is not implemented in Dask
array_api_tests/test_signatures.py::test_has_names[to_device]
array_api_tests/test_signatures.py::test_has_names[device]
# don't run special cases yet
array_api_tests/special_cases
The next steps might be to investigate the failures, and to implement some of the functions that are missing in Dask (some definitely easier than others).
Thanks for pushing on this @tomwhite! 😄
Nice to see some initial success here. Agree we would want a different array_api
module like other libraries have. Just like the other libraries we wouldn't want to break existing functionality and would want users to opt-in at some level.
Yeah this is a good point that really hasn't come up in our discussions thus far @rgommers @kgryte (likely because it is very Dask specific), but creation functions in Dask need a chunks
argument. There really isn't something like this in the Array API and I'm not sure it makes much sense for other libraries, but perhaps we could allow these functions to take some additional arguments that the spec leaves open ended? Anyways would be good to get your thoughts on this.
Looking at the failures. Guessing getitem
exercises some selection case that Dask doesn't implement (though would be good to know what it is and link that to a Dask issue). With eye
, probably the Array API needs to skip passing dtype
for arguments that are None
or Dask should replace None
with the default type internally. Same with other functions that see similar issues. linspace
smells like a bug (would just need to look into the input arguments and file a related issue). Dask doesn't implement can_cast
so it appears that failure comes from NumPy. Dask's tensordot
promotes types. unique
has nan
in shape
for unknown dimensions (don't really see that changing). Though the Array API wrapper could translate nan
to None
(would require some kind of wrapper Array
object most likely). Not really sure with the other error messages (probably require deeper dives).
Seems we fallback to auto chunking in a lot of cases. Idk how desirable that is. It is simpler to implement, but may have performance costs depending on what chunks are needed and what are used.
Great to see Dask is decently compliant already! Thanks for the wrapper @tomwhite.
Guessing
getitem
exercises some selection case that Dask doesn't implement (though would be good to know what it is and link that to a Dask issue).
Ah assuming you mean the failing test_getitem_masking
test and not test_getitem
, the former is for the opt-in boolean array indexing support so it's fine if it fails. In the future there'll be a way to specify for if boolean masking should be skipped (pytest flag?), along with the other opt-in functions like nonzero.
Dask doesn't implement
can_cast
so it appears that failure comes from NumPy.
Yep this is known, hopefully you can keep relying on numpy.array_api
here if https://github.com/numpy/numpy/pull/20883 goes through.
unique
hasnan
inshape
for unknown dimensions (don't really see that changing).
Ah this is interesting, it might be a good idea then for the suite to test shapes last so it can also get to test the dtype and the out values are as expected.
In the future there'll be a way to specify for if boolean masking should be skipped (pytest flag?), along with the other opt-in functions like nonzero.
Ah this is interesting, it might be a good idea then for the suite to test shapes last so it can also get to test the dtype and the out values are as expected.
unique() also falls into that category of functions that have value-based shapes that we should allow skipping entirely with dask.
Thanks for all the comments.
I've filed Dask issues for some of these errors: https://github.com/dask/dask/issues/8667, https://github.com/dask/dask/issues/8668, https://github.com/dask/dask/issues/8669, https://github.com/dask/dask/issues/8670.
With
eye
, probably the Array API needs to skip passingdtype
for arguments that areNone
or Dask should replaceNone
with the default type internally.
Filed https://github.com/dask/dask/issues/8669. I tried a local fix, and it fixes all the TypeError: dtype must be known for auto-chunking
errors - not just the one for the eye
test.
linspace
smells like a bug
This is https://github.com/dask/dask/issues/8670.
unique() also falls into that category of functions that have value-based shapes that we should allow skipping entirely with dask.
+1
I also noticed that all of the special case tests fail because they rely on boolean indexing, which - while supported in Dask - is lazy and causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling compute_chunk_sizes()
on the Dask side) - and the tests pass, showing that the special cases are working (for two hypothesis examples at least). I wonder if the special case tests could avoid using boolean indexing? I filed https://github.com/data-apis/array-api-tests/issues/94 for this.
I also noticed that all of the special case tests fail because they rely on boolean indexing, which - while supported in Dask - is lazy and causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling
compute_chunk_sizes()
on the Dask side) - and the tests pass, showing that the special cases are working (for two hypothesis examples at least). I wonder if the special case tests could avoid using boolean indexing? I filed data-apis/array-api-tests#94 for this.
Yep we've been phasing out masking throughout the suite, and it's just special cases left! (Special cases will get generally overhauled in the next two weeks, we just left it alone because the spec's RST conversion was blocking until now.)
causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling
compute_chunk_sizes()
on the Dask side)
Is it ok to always have to call compute_chunk_sizes()
for boolean indexing though? There are several cases when it's "easy" to figure out the shape of the masked array, see https://github.com/dask/dask/issues/8638, why shouldn't they just work?
Not really. Calling compute_chunk_sizes
is quite expensive.
Yeah this is a good point that really hasn't come up in our discussions thus far @rgommers @kgryte (likely because it is very Dask specific), but creation functions in Dask need a
chunks
argument. There really isn't something like this in the Array API and I'm not sure it makes much sense for other libraries, but perhaps we could allow these functions to take some additional arguments that the spec leaves open ended?
Yes, it is always fine to extend the array API standard signatures with extra keywords that are library-specific. At that point of course you lose portability, but that's fine. You have a default that is reasonable, and then a Dask-specific way to tweak chunks
.
The array API standard signatures making heavy use of keyword-only arguments helps here. If you keep chunks
as keyword-only, then there won't be a backwards compatibility issue in the future in case the standard grows another keyword.
I've filed Dask issues for some of these errors: dask/dask#8667, dask/dask#8668, dask/dask#8669, dask/dask#8670.
With these fixes the number of unexpected failures is now down to 11: https://github.com/tomwhite/dask/runs/5142135752?check_suite_focus=true. Most of them seem to be indexing issues.
@tomwhite Ah it looks like the argmin
/argmax
tests are wrong (its generating a non-None
axis argument for 0d arrays). Will fix!
Ah and test_positive
/test_asarray_arrays
is failing because of an internal util uses masking (ph.assert_array
), will have a think about that.
Have reviewed/merged 3 of those. The 4th one still has a pending review comment.
Thanks @honno and @jakirkham!
@tomwhite I pushed some improvements to the test suite per your work here, thanks!
I stopped generating 0d arrays in test_argmin
/test_argmax
for now... need to mull it over. I had first removed the original bug of passing axis=0
with 0d array inputs, but was seeing this "bug":
>>> from dask import array_api as dxp
>>> x =d xp.asarray(0)
>>> out = dxp.argmin(x)
>>> int(out)
ValueError: At least one iterator operand must be non-NULL
...
>>> from numpy import array_api as nxp
>>> nxp.argmin(nxp.asarray(0))
Array(0, dtype=int64) # i.e. the first index
Can Dask not support 0d inputs for argmin
/argmax
? Or maybe the conceptual model of having a first-and-only index of a 0d array is nonsensical anywho.
RE test_positive
/test_asarray_arrays
failing due to using masks, I've now removed the use of boolean masking. test_positive
passes, but test_asarray_arrays
is seemingly failing due to erroneous mutability when copy=True
.
Per @asmeurer's suggestion I've added a --disable-data-dependent-shapes
flag to skip the unique-related and nonzero tests completely. We do rely on the shape for the values testing, but maybe in the future we can still test the values for these functions... not quite sure.
I'm not sure how dask works, but I would assume that so long as you avoid operations that produce data-dependent shapes (unique
, boolean masking, etc.), the .shape
of anything should be statically computable from the inputs, so dask should be able to give a shape even if it hasn't computed the values.
I've opened https://github.com/dask/dask/pull/8750 to provide the Dask implementation. Comments very welcome.
Hi all,
A group of folks from various projects / companies have put together a draft of an Array API standard, and are looking for feedback.
We have lots of experience working with Arrays, both as consumers of arrays implementing the NumPy API (cupy, sparse) and implementing the API ourselves. Our feedback on the proposed standard would be welcome in the GitHub repo linked above.
We can use this issue to discuss if and how
dask.array
should adopt this standard. The document gives some guidance at https://data-apis.github.io/array-api/latest/purpose_and_scope.html#how-to-adopt-this-api, but this is fairly uncertain territory.