dask / community

For general discussion and community planning. Discussion issues welcome.
20 stars 3 forks source link

Python Array API Standard #109

Open TomAugspurger opened 3 years ago

TomAugspurger commented 3 years ago

Hi all,

A group of folks from various projects / companies have put together a draft of an Array API standard, and are looking for feedback.

We have lots of experience working with Arrays, both as consumers of arrays implementing the NumPy API (cupy, sparse) and implementing the API ourselves. Our feedback on the proposed standard would be welcome in the GitHub repo linked above.

We can use this issue to discuss if and how dask.array should adopt this standard. The document gives some guidance at https://data-apis.github.io/array-api/latest/purpose_and_scope.html#how-to-adopt-this-api, but this is fairly uncertain territory.

jakirkham commented 3 years ago

Do we have someone representing us on the array side? If not, I'd be happy to volunteer. Though would need to get plugged into the meetings somehow

kkraus14 commented 3 years ago

cc @rgommers for visibility

rgommers commented 3 years ago

Thanks @jakirkham! So far it's been mainly @TomAugspurger and @shoyer who contributed Dask expertise. Would be great to have you join. Also happy to have a call soon to catch you up on what's been happening. I'll try and collect a few of the Dask-related points (there weren't many at all in the past few months) and will get back to you.

Also, let me point out the NumPy Enhancement Proposal for adoption of the array API standard: https://github.com/numpy/numpy/pull/18456

jakirkham commented 3 years ago

Sounds good. Happy to do some reading as well (guessing there's lots to catch up on). That would be great. Thank you Ralf! 😄

Yep saw that. Will take a look at that as well 🙂

rgommers commented 3 years ago

The one issue with an unresolved Dask-related angle is https://github.com/data-apis/array-api/issues/97 (related to data/value-dependent output shapes and how to represent "unknown shape" - Dask uses nan and TensorFlow None).

https://github.com/data-apis/array-api/issues/84 on including boolean indexing that's hard for Dask to support may also be good to review. Discussed concluded with color-coding that indexing mode as "not all libraries may support this, it's hard in combination with delayed evaluation and JIT compilation" (not implemented yet, that's why the issue is still open).

Other than that, I'd suggest looking at https://data-apis.github.io/array-api/latest/ and https://numpy.org/neps/nep-0047-array-api-standard.html for things that are suboptimal or missing from Dask's perspective.

jakirkham commented 3 years ago

Commented on the nan issue. For context this was added in PR ( https://github.com/dask/dask/pull/1838 ). Maybe it can change, but to set expectations it is likely a decent lift and error prone change. Additionally other maintainers may be wary of such a change as it could be a source of bugs going forward. Admittedly maybe other libraries feel similarly depending on how heavily they use this feature. Dask is fairly reliant on it as you have seen with masking, raveling, or other operations where the content of the array may affect the result (unique).

Dask does support a fair bit of boolean indexing (though we may find some edge cases that don't work). However as the mask itself may have unknown values, this falls in the category of things where the array's content affects the array shape and so the shape of the array is unknown. I'm not sure I'd write this off as not supported, but it's also possible I don't understand what supported would mean in the context of the spec or the cases you have in mind. So more context here would be welcome

Ok will read up on this before we talk more 🙂


Edit: It would be good to discuss ravel somewhere. Have held off opening an issue, but I think trying to match NumPy too closely here is a pain point for Dask. If we could come up with a sensible way to relax that, which we agree would be acceptable, some (though certainly not all) unknown shape issues would go away. This would also offer a performance boost for users. See issue ( https://github.com/dask/dask/issues/4855 ) for more details

jakirkham commented 3 years ago

Also related to Data API framework. Generally we reach for and write functions to check if they are generally an Array, DataFrame, Series, etc. ( https://github.com/dask/dask/issues/7327 ) Not sure if there are general tools scoped out in this framework to make those checks easier, but that would be very helpful for Dask

leofang commented 3 years ago

Interesting. AFAIK there is not yet a way to distinguish a Data API array from a Data API dataframe. But since the latter is still in a draft status (more "drafty" than the former, as it's not yet made public), perhaps we can discuss about it?

rgommers commented 3 years ago

Edit: It would be good to discuss ravel somewhere.

There is no ravel function. The NumPy API is messy in this regard: ravel, flatten, flat, and reshape are all similar, with some random differences like returning a copy vs. a view (which cannot be a distinguishing feature for the API standard, views are an implementation detail). The standard just has reshape; reshape(-1) is equivalent to ravel.

Generally we reach for and write functions to check if they are generally an Array, DataFrame, Series, etc. ( dask/dask#7327 ) Not sure if there are general tools scoped out in this framework to make those checks easier, but that would be very helpful for Dask

Arrays which support the standard, or from which you can at least retrieve the standard namespace with a compliant implementation, have a __array_namespace__ attribute. That should be the easiest way to detect an array. For static typing we may develop a protocol (see https://data-apis.org/array-api/latest/design_topics/static_typing.html), but given that there's no plan to have a shared reference library that other libraries can depend on, there's no good way to design an isinstance check. Which is fine I think, a simple utility function that uses hasattr(x, '__array_namespace__') should be fine.

leofang commented 3 years ago

OK. With the ongoing discussion with the dataframe API, a similar check can be done for dataframes: hasattr(x, '__dataframe__'). But, it's important to note that neither attributes __array_namespace__ and __dataframe__ can have any side effect, or it'd be triggered when calling hasattr. Not sure if it's possible to enforce this requirement in the Array/Dataframe API Standard...

rgommers commented 3 years ago

But, it's important to note that neither attributes __array_namespace__ and __dataframe__ can have any side effect, or it'd be triggered when calling hasattr. Not sure if it's possible to enforce this requirement in the Array/Dataframe API Standard...

Already enforced by them being methods I think - attributes can easily have side effects, methods can't. Of course Python is dynamic in nature, so you could come up with some hack that has an unwanted side effect, but it's highly unlikely to occur in practice.

honno commented 2 years ago

If folks wanted to start using the compliance suite array-api-tests, you might like to check out this simple wrapper that implements some crucial things most tests require. I wasn't sure if dask.array was also going to be your Array API namespace ala PyTorch—that's what I assumed with this at least. https://gist.github.com/honno/70623811b88c0f5b224c668ec4ed86b7

Generally, please feel free to reach out via issues for any questions/bugs. We want to make the suite easier to use for implementors, so any feedback would be appreciated.

tomwhite commented 2 years ago

Thanks @honno for providing the wrapper and the test suite.

There are likely to be some compatibility issues with using dask.array as the Array API namespace. To pick one example, in Dask the signature for eye is

eye(N, chunks='auto', M=None, k=0, dtype=<class 'float'>)

whereas in the Array API it is

eye(n_rows, n_cols, /, *, k=0  dtype=None, device=None)

This means that an expression like eye(100, 10) means two different things: in Dask 10 refers to the chunk size, but in the Array API it's the number of columns (M). It's not clear if these can be made compatible.

It might be sensible therefore to use a different module, like dask.array_api, just like NumPy, CuPy, and MXNet do. As a proof of concept I've implemented a wrapper to do that in this branch. (File naming and structure is based on numpy.array_api.) The array object is a regular dask.array.Array though.

Running the test suite I get:

ARRAY_API_TESTS_MODULE=dask.array_api pytest --max-examples=2
[snip]
====================================================================== short test summary info =======================================================================
FAILED array_api_tests/test_array_object.py::test_getitem_masking - IndexError: Too many indices for array
FAILED array_api_tests/test_creation_functions.py::test_eye - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_creation_functions.py::test_linspace - ZeroDivisionError: float division by zero
FAILED array_api_tests/test_data_type_functions.py::test_can_cast - AssertionError: out=True, but should be False [can_cast(bool, int8)]
FAILED array_api_tests/test_linalg.py::test_tensordot - AssertionError: out.dtype=uint64, but should be uint8 [tensordot(uint8, uint8)]
FAILED array_api_tests/test_manipulation_functions.py::test_concat - TypeError: '<' not supported between instances of 'NoneType' and 'int'
FAILED array_api_tests/test_operators_and_elementwise_functions.py::test_positive[positive] - IndexError: Too many indices for array
FAILED array_api_tests/test_operators_and_elementwise_functions.py::test_positive[__pos__] - IndexError: Too many indices for array
FAILED array_api_tests/test_searching_functions.py::test_argmax - ValueError: At least one iterator operand must be non-NULL
FAILED array_api_tests/test_searching_functions.py::test_argmin - ValueError: At least one iterator operand must be non-NULL
FAILED array_api_tests/test_searching_functions.py::test_nonzero - AssertionError: out[0].size=1, but should be out[0].size=nan
FAILED array_api_tests/test_set_functions.py::test_unique_all - AssertionError: out.indices.shape=(nan,), but should be out.values.shape=(nan,)
FAILED array_api_tests/test_set_functions.py::test_unique_counts - AssertionError: out.counts.shape=(nan,), but should be out.values.shape=(nan,)
FAILED array_api_tests/test_set_functions.py::test_unique_inverse - TypeError: 'float' object cannot be interpreted as an integer
FAILED array_api_tests/test_set_functions.py::test_unique_values - TypeError: 'float' object cannot be interpreted as an integer
FAILED array_api_tests/test_signatures.py::test_has_names[__array_namespace__] - AssertionError: The array object is missing the method __array_namespace__()
FAILED array_api_tests/test_signatures.py::test_function_positional_args[eye] - AssertionError: Unexpected exception TypeError('dtype must be known for auto-chunki...
FAILED array_api_tests/test_signatures.py::test_function_positional_args[matmul] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[matrix_transpose] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[tensordot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_positional_args[vecdot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[eye] - AssertionError: Unexpected exception TypeError('dtype must be known for auto-chun...
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[matmul] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[matrix_transpose] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[tensordot] - TypeError: dtype must be known for auto-chunking
FAILED array_api_tests/test_signatures.py::test_function_keyword_only_args[vecdot] - TypeError: dtype must be known for auto-chunking
================================================ 26 failed, 3544 passed, 541 skipped, 5 warnings in 83.74s (0:01:23) =================================================

So: a few failures, as might be expected, but also lots of tests passing, which is very encouraging.

A few caveats: I ran with only a maximum of two hypothesis examples just to do a quick check at this stage. Increasing it would almost certainly find more failing cases. I also used the following skips.txt to exclude functions that are known to be missing in Dask (and also special cases which are failing).

# expand_dims is not implemented in Dask
array_api_tests/test_manipulation_functions.py::test_expand_dims
array_api_tests/test_signatures.py::test_has_names[expand_dims]

# sorting is not implemented in Dask (https://github.com/dask/dask/issues/4368)
array_api_tests/test_sorting_functions.py::test_argsort
array_api_tests/test_sorting_functions.py::test_sort
array_api_tests/test_signatures.py::test_has_names[argsort]
array_api_tests/test_signatures.py::test_has_names[sort]

# in-place array operations are not implemented in Dask (https://github.com/dask/dask/issues/5199)
array_api_tests/test_signatures.py::test_has_names[__iadd__]
array_api_tests/test_signatures.py::test_has_names[__iand__]
array_api_tests/test_signatures.py::test_has_names[__ifloordiv__]
array_api_tests/test_signatures.py::test_has_names[__ilshift__]
array_api_tests/test_signatures.py::test_has_names[__imatmul__]
array_api_tests/test_signatures.py::test_has_names[__imod__]
array_api_tests/test_signatures.py::test_has_names[__imul__]
array_api_tests/test_signatures.py::test_has_names[__ior__]
array_api_tests/test_signatures.py::test_has_names[__ipow__]
array_api_tests/test_signatures.py::test_has_names[__irshift__]
array_api_tests/test_signatures.py::test_has_names[__isub__]
array_api_tests/test_signatures.py::test_has_names[__itruediv__]
array_api_tests/test_signatures.py::test_has_names[__ixor__]

# DLPack is not implemented in Dask
array_api_tests/test_signatures.py::test_has_names[__dlpack__]
array_api_tests/test_signatures.py::test_has_names[__dlpack_device__]
array_api_tests/test_signatures.py::test_has_names[from_dlpack]

# device support is not implemented in Dask
array_api_tests/test_signatures.py::test_has_names[to_device]
array_api_tests/test_signatures.py::test_has_names[device]

# don't run special cases yet
array_api_tests/special_cases

The next steps might be to investigate the failures, and to implement some of the functions that are missing in Dask (some definitely easier than others).

jakirkham commented 2 years ago

Thanks for pushing on this @tomwhite! 😄

Nice to see some initial success here. Agree we would want a different array_api module like other libraries have. Just like the other libraries we wouldn't want to break existing functionality and would want users to opt-in at some level.

Yeah this is a good point that really hasn't come up in our discussions thus far @rgommers @kgryte (likely because it is very Dask specific), but creation functions in Dask need a chunks argument. There really isn't something like this in the Array API and I'm not sure it makes much sense for other libraries, but perhaps we could allow these functions to take some additional arguments that the spec leaves open ended? Anyways would be good to get your thoughts on this.

Looking at the failures. Guessing getitem exercises some selection case that Dask doesn't implement (though would be good to know what it is and link that to a Dask issue). With eye, probably the Array API needs to skip passing dtype for arguments that are None or Dask should replace None with the default type internally. Same with other functions that see similar issues. linspace smells like a bug (would just need to look into the input arguments and file a related issue). Dask doesn't implement can_cast so it appears that failure comes from NumPy. Dask's tensordot promotes types. unique has nan in shape for unknown dimensions (don't really see that changing). Though the Array API wrapper could translate nan to None (would require some kind of wrapper Array object most likely). Not really sure with the other error messages (probably require deeper dives).

Seems we fallback to auto chunking in a lot of cases. Idk how desirable that is. It is simpler to implement, but may have performance costs depending on what chunks are needed and what are used.

honno commented 2 years ago

Great to see Dask is decently compliant already! Thanks for the wrapper @tomwhite.

Guessing getitem exercises some selection case that Dask doesn't implement (though would be good to know what it is and link that to a Dask issue).

Ah assuming you mean the failing test_getitem_masking test and not test_getitem, the former is for the opt-in boolean array indexing support so it's fine if it fails. In the future there'll be a way to specify for if boolean masking should be skipped (pytest flag?), along with the other opt-in functions like nonzero.

Dask doesn't implement can_cast so it appears that failure comes from NumPy.

Yep this is known, hopefully you can keep relying on numpy.array_api here if https://github.com/numpy/numpy/pull/20883 goes through.

unique has nan in shape for unknown dimensions (don't really see that changing).

Ah this is interesting, it might be a good idea then for the suite to test shapes last so it can also get to test the dtype and the out values are as expected.

asmeurer commented 2 years ago

In the future there'll be a way to specify for if boolean masking should be skipped (pytest flag?), along with the other opt-in functions like nonzero.

Ah this is interesting, it might be a good idea then for the suite to test shapes last so it can also get to test the dtype and the out values are as expected.

unique() also falls into that category of functions that have value-based shapes that we should allow skipping entirely with dask.

tomwhite commented 2 years ago

Thanks for all the comments.

I've filed Dask issues for some of these errors: https://github.com/dask/dask/issues/8667, https://github.com/dask/dask/issues/8668, https://github.com/dask/dask/issues/8669, https://github.com/dask/dask/issues/8670.

With eye, probably the Array API needs to skip passing dtype for arguments that are None or Dask should replace None with the default type internally.

Filed https://github.com/dask/dask/issues/8669. I tried a local fix, and it fixes all the TypeError: dtype must be known for auto-chunking errors - not just the one for the eye test.

linspace smells like a bug

This is https://github.com/dask/dask/issues/8670.

unique() also falls into that category of functions that have value-based shapes that we should allow skipping entirely with dask.

+1

I also noticed that all of the special case tests fail because they rely on boolean indexing, which - while supported in Dask - is lazy and causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling compute_chunk_sizes() on the Dask side) - and the tests pass, showing that the special cases are working (for two hypothesis examples at least). I wonder if the special case tests could avoid using boolean indexing? I filed https://github.com/data-apis/array-api-tests/issues/94 for this.

honno commented 2 years ago

I also noticed that all of the special case tests fail because they rely on boolean indexing, which - while supported in Dask - is lazy and causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling compute_chunk_sizes() on the Dask side) - and the tests pass, showing that the special cases are working (for two hypothesis examples at least). I wonder if the special case tests could avoid using boolean indexing? I filed data-apis/array-api-tests#94 for this.

Yep we've been phasing out masking throughout the suite, and it's just special cases left! (Special cases will get generally overhauled in the next two weeks, we just left it alone because the spec's RST conversion was blocking until now.)

Illviljan commented 2 years ago

causes these tests to fail because array shapes are unknown. I made a temporary local change to force shape computation (by calling compute_chunk_sizes() on the Dask side)

Is it ok to always have to call compute_chunk_sizes() for boolean indexing though? There are several cases when it's "easy" to figure out the shape of the masked array, see https://github.com/dask/dask/issues/8638, why shouldn't they just work?

jakirkham commented 2 years ago

Not really. Calling compute_chunk_sizes is quite expensive.

rgommers commented 2 years ago

Yeah this is a good point that really hasn't come up in our discussions thus far @rgommers @kgryte (likely because it is very Dask specific), but creation functions in Dask need a chunks argument. There really isn't something like this in the Array API and I'm not sure it makes much sense for other libraries, but perhaps we could allow these functions to take some additional arguments that the spec leaves open ended?

Yes, it is always fine to extend the array API standard signatures with extra keywords that are library-specific. At that point of course you lose portability, but that's fine. You have a default that is reasonable, and then a Dask-specific way to tweak chunks.

The array API standard signatures making heavy use of keyword-only arguments helps here. If you keep chunks as keyword-only, then there won't be a backwards compatibility issue in the future in case the standard grows another keyword.

tomwhite commented 2 years ago

I've filed Dask issues for some of these errors: dask/dask#8667, dask/dask#8668, dask/dask#8669, dask/dask#8670.

With these fixes the number of unexpected failures is now down to 11: https://github.com/tomwhite/dask/runs/5142135752?check_suite_focus=true. Most of them seem to be indexing issues.

honno commented 2 years ago

@tomwhite Ah it looks like the argmin/argmax tests are wrong (its generating a non-None axis argument for 0d arrays). Will fix!

Ah and test_positive/test_asarray_arrays is failing because of an internal util uses masking (ph.assert_array), will have a think about that.

jakirkham commented 2 years ago

Have reviewed/merged 3 of those. The 4th one still has a pending review comment.

tomwhite commented 2 years ago

Thanks @honno and @jakirkham!

honno commented 2 years ago

@tomwhite I pushed some improvements to the test suite per your work here, thanks!


I stopped generating 0d arrays in test_argmin/test_argmax for now... need to mull it over. I had first removed the original bug of passing axis=0 with 0d array inputs, but was seeing this "bug":

>>> from dask import array_api as dxp
>>> x =d xp.asarray(0)
>>> out = dxp.argmin(x)
>>> int(out)
ValueError: At least one iterator operand must be non-NULL
...
>>> from numpy import array_api as nxp
>>> nxp.argmin(nxp.asarray(0))
Array(0, dtype=int64)  # i.e. the first index

Can Dask not support 0d inputs for argmin/argmax? Or maybe the conceptual model of having a first-and-only index of a 0d array is nonsensical anywho.


RE test_positive/test_asarray_arrays failing due to using masks, I've now removed the use of boolean masking. test_positive passes, but test_asarray_arrays is seemingly failing due to erroneous mutability when copy=True.

Per @asmeurer's suggestion I've added a --disable-data-dependent-shapes flag to skip the unique-related and nonzero tests completely. We do rely on the shape for the values testing, but maybe in the future we can still test the values for these functions... not quite sure.

asmeurer commented 2 years ago

I'm not sure how dask works, but I would assume that so long as you avoid operations that produce data-dependent shapes (unique, boolean masking, etc.), the .shape of anything should be statically computable from the inputs, so dask should be able to give a shape even if it hasn't computed the values.

tomwhite commented 2 years ago

I've opened https://github.com/dask/dask/pull/8750 to provide the Dask implementation. Comments very welcome.