PyO3 / rust-numpy

PyO3-based Rust bindings of the NumPy C-API
BSD 2-Clause "Simplified" License
1.07k stars 98 forks source link

PyArrayLike type introduced #383

Closed 124C41p closed 9 months ago

124C41p commented 1 year ago

Summary

Extracts a read only reference if the correct numpy array type is given. Tries to convert the input into the correct type otherwise.

Resolves #382

Disclaimer

This implementation seems to work as intended and covers my personal use case. However, I am not sure how well it performs (you certainly want it to be not significantly slower than calling np.asarray(...) on Python side, and then extracting PyReadonlyArray<...> on Rust side). Also, there are several tasks remaining like documentation and proper error handling.

Do you want to take over at this point for finalizing the branch in your own style?

124C41p commented 1 year ago

I added some smaller comments inline, but the main point I am wonder now that I have look at the pull request is whether the converted case should be a plain ndarray::Array or rather a NumPy array as well. This would have the benefit of a more uniform API and would allow the result to be passed to functions expecting a NumPy array. I am not sure if the cost of creating a NumPy array versus an ndarray::Array is prohibitive though.

I think spelled out this would look something like

pub struct PyArrayLike<'py, T, D>(PyReadonlyArray<'py, T, D>);

impl Deref for PyArrayLike<'py, T, D> {
  type Target = PyReadonlyArray<'py, T, D>;

  ...
}

impl FromPyObject<'py> for PyArrayLike<'py, T, D> {
   // ... as before, just constructing `PyArray` ...
}

What do you think?

I like this idea, and I just published a commit implementing it. There might be one downside however: If the array is created by the np.asarray fallback, we only get a PyReadonlyArray, but we should actually own a PyArray. If the user calls .to_owned_array() afterwards, we are creating a copy of a copy. Maybe it would be better to define it that way?

pub struct PyArrayLike<T, D>(Cow<PyArray<T, D>>)
adamreichold commented 1 year ago

If the array is created by the np.asarray fallback, we only get a PyReadonlyArray, but we should actually own a PyArray.

PyReadonlyArray<'py, T, D> is just a wrapper around a &'py PyArray<T, D> and derefs into that. A bare reference is as much ownership as Rust code can get for Python types. So the call to to_owned_array will do the same whether applied to PyReadonlyArray or &PyArray.

(Note that even constructors like PyArray::from_vec return only &PyArray, not an owned PyArray in the Rust sense since this type always lives on the Python heap.)

adamreichold commented 1 year ago

Just out of curiosity: You mentioned ob.downcast::<PyArray<T, D>>()?.readonly() being faster than ob.extract::<PyReadonlyArray<T, D>>(). But is there any downside? Is it less safe? Is there any situation where you would prefer the second over the first?

Functionally they are equivalent. There is one performance paper cut that if you do not use the error returned by ob.extract::<PyReadonlyArray<T, D>>(), e.g. in

if let Ok(array) = ob.extract<PyReadonlyArray<T, D>>() {
  return Ok(Self(array));
}

then this is measurably slower than ob.downcast::<PyArray<T, D>>()?.readonly() because extract always returns Result<_, PyErr> which means the PyDowncastError returned by downcast (which extract uses internally) needs to be wrapped up in a PyErr, only to be thrown away again.

This is a general PyO3 issue which have not been able to resolve thus far. (FromPyObject will probably need an associated type for the error, but combining multiple FromPyObject implementations into new ones still needs to be possible. No rocket science, but a lot of code churn over the whole ecosystem to make it happen.)

You can also see rust-numpy trying to work around this here: https://github.com/PyO3/rust-numpy/blob/3843fa94c5b9e0b035a0b1cb8a70c164ef153a21/src/array.rs#L135-L137

124C41p commented 1 year ago

The review request was a misclick, sorry.

kngwyu commented 1 year ago

@124C41p Thank you for this PR, and @adamreichold thank you for your encouraging tutoring. I have one question about the design: is this really need to be struct? I mean, if we don't need C: Coerce information after extracting, maybe it's better to just have a function, say, extract_arraylike(obj, allow_typecast).

124C41p commented 1 year ago

@kngwyu To my understanding having a struct is necessary for conveniently using it inside the header of a pyfunction. Also, since the struct has only two fields, one being a phantom type, it should get optimized away by the compiler anyway. So extracting a PyArrayLike should not be more expensive (in the fast path) than extracting a PyReadonlyArray.

kngwyu commented 1 year ago

I see, you're right that we definitely need them to use as function arguments.

kngwyu commented 1 year ago

BTW, I checked asarray and _array_fromobject_generic implementations, finding that it just tries some faster paths (e.g., obj is already an array) and then calls PyArray_CheckFromAny_int, which is an internal version of PyArray_CheckFromAny. So, I think that using PyArray_CheckFromAny is better because it is an exposed C-API, if we can handle faster paths correctly. So now the problem is whether we can handle faster paths simply enough. The code was a bit complex to read for me, but it looks like only handles the cases where the given object is already a numpy array or an instance of an array subclass. Maybe it's worth considering to use it?

adamreichold commented 1 year ago

BTW, I checked asarray and _array_fromobject_generic implementations, finding that it just tries some faster paths (e.g., obj is already an array) and then calls PyArray_CheckFromAny_int, which is an internal version of PyArray_CheckFromAny. So, I think that using PyArray_CheckFromAny is better because it is an exposed C-API, if we can handle faster paths correctly. So now the problem is whether we can handle faster paths simply enough. The code was a bit complex to read for me, but it looks like only handles the cases where the given object is already a numpy array or an instance of an array subclass. Maybe it's worth considering to use it?

I agree in principle and also started looking into which FFI code paths this ends up eventually. But as you say, it is somewhat convoluted and not at all obvious how to call PyArray_CheckFromAny. Therefore, I would suggest finishing this using the asarray fallback and then working on the FFI integration as an optimization follow-up. (Basically, this needs docs and examples to ready for review, but I have not had the time to write that yet.)

kngwyu commented 1 year ago

I agree, but let me confirm one thing...

not at all obvious how to call PyArray_CheckFromAny

We already have it, right? Do you mean that it's not obvious how to use it?

adamreichold commented 1 year ago

We already have it, right? Do you mean that it's not obvious how to use it?

Yes, I do not refer to the signature, just how to use it correctly. Which code paths to handle ourselves. Which parameters to pass and so on.

adamreichold commented 9 months ago

@kngwyu I added docs and examples and think this is good to go now. Could you have another look? Thanks!