apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.55k stars 766 forks source link

Use `'static` lifetime in `BoxFuture` for all object-store APIs #6587

Open kylebarron opened 4 days ago

kylebarron commented 4 days ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm writing a new Python binding for the object-store crate (differences from the existing one detailed here).

I'm trying to present streaming APIs to the user instead of materializing an entire result stream upfront. This worked for get, where we can expose the stream returned by GetResult::into_stream as a Python async iterable.

When I tried to do present a streaming result API for list, I tried to cast the result of ObjectStore::list to

let stream: BoxStream<'static, object_store::Result<ObjectMeta>> = store.list(prefix);

and got that the store does not live long enough.

error[E0597]: `store` does not live long enough
   --> object-store-rs/src/list.rs:175:18
    |
168 |     store: PyObjectStore,
    |     ----- binding `store` declared here
...
175 |     let stream = store.as_ref().list(prefix.map(|s| s.into()).as_ref());
    |                  ^^^^^ borrowed value does not live long enough
176 |     let stream2: BoxStream<'static, object_store::Result<ObjectMeta>> = stream;
    |                  ---------------------------------------------------- type annotation requires that `store` is borrowed for `'static`
...
189 | }
    | - `store` dropped here while still borrowed

It's not possible to use a lifetime other than 'static in a struct exported to Python, so I'm not sure if it's possible to wrap list as a stream currently.

@tustvold mentioned in discord that it may be possible to change other methods to use 'static in the next major object-store release.

Describe the solution you'd like

The most straightforward solution would be to change the ObjectStore trait to use the 'static lifetime. But I'm open to other solutions. I don't fully understand async lifetimes well enough to know all the implications here.

Describe alternatives you've considered

Additional context

tustvold commented 3 days ago

I'm curious how python handles the lifetimes present in the futures and if it possible to do something similar for streams?

Currently all the futures have a non-static lifetime, but your comments indicate it is only the streams where this causes issue?

kylebarron commented 3 days ago

To the extent I understand it's okay with futures because they're evaluated immediately, either asynchronously via https://github.com/PyO3/pyo3-async-runtimes or synchronously via a tokio runtime. It matters whether the result of the future can be materialized as data and presented to Python or whether it should be preserved as a stream.

With the stream returned from ObjectStore::get, I'm storing the BoxStream itself in a struct exported to Python https://github.com/developmentseed/object-store-rs/blob/c157fe2cb8da14c1e8eee297014bb4695d683cc3/object-store-rs/src/get.rs#L111-L115. Then when the Python async iteration calls __aiter__, that creates a new future based on polling the stream.

But I think this is only possible because BoxStream is 'static. For another lifetime, you'd have to annotate the struct containing the stream with that lifetime, which pyo3 immediately rejects.

tustvold commented 3 days ago

Ok so if I follow, this would mean that in order for tokio to poll the stream, the python async code would need to invoke it in a timely manner? Or to put it differently, if the python code is busy doing something else, the tokio work would get starved. I wonder if this is going to lead to the same sorts of issues we've run into multiplexing CPU bound tasks on the same threadpool? Perhaps there needs to be some sort of buffering between the two regardless, which would obviate any lifetime shenanigans? The same issue would also potentially apply to regular futures

TBC I am not opposed to changing the signature, just trying to ensure the python bindings work as well as possible 😅

kylebarron commented 7 hours ago

I'm not fully clear on all the details on how the pyo3-async-runtimes integration works. From its readme:

in pyo3-asyncio, we decided the best way to handle Rust/Python interop was to just surrender the main thread to Python and run Rust's event loops in the background.

And I know there's revamped async work going on in https://github.com/PyO3/pyo3/issues/1632 and https://github.com/wyfo/pyo3-async, which is suppose to remove some overhead of integrating rust and python async.


To answer your questions:

Ok so if I follow, this would mean that in order for tokio to poll the stream, the python async code would need to invoke it in a timely manner? Or to put it differently, if the python code is busy doing something else, the tokio work would get starved.

I think so, yes. At least in my current implementation of this, if __anext__ is never called, then the tokio future's next() is never called. So it relies on Python to drive this async.

I wonder if this is going to lead to the same sorts of issues we've run into multiplexing CPU bound tasks on the same threadpool?

Potentially, but I'd say this is up to the user to ensure the Python async runtime doesn't get blocked, and that they're not running CPU bound tasks on the main thread.

Perhaps there needs to be some sort of buffering between the two regardless, which would obviate any lifetime shenanigans?

I'm not sure how to implement this 😅 , or else I'd try to implement it now.

kylebarron commented 5 hours ago

just trying to ensure the python bindings work as well as possible

Trying to think of other things on my wish list to mention for the Python bindings...

One thing that would be great but I certainly don't expect to change is if object_store used something like an arrow::buffer::Buffer, which permits externally-allocated memory. Right now in a put we always have to make a copy from Python-backed memory to Rust-backed memory first. Or is there some way to use external memory with a bytes::Bytes?

tustvold commented 4 hours ago

Or is there some way to use external memory with a bytes::Bytes?

There have been various discussions about exposing the bytes vtable externally, but they've not gotten anywhere AFAICT. This would really be my preferred path, it seems unfortunate to break from the ecosystem with our own wrapper

kylebarron commented 4 hours ago

it seems unfortunate to break from the ecosystem with our own wrapper

Absolutely agree, even though it's unfortunate for my use case.