apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
173 stars 38 forks source link

'Open' stream example ? #386

Open eddelbuettel opened 8 months ago

eddelbuettel commented 8 months ago

The package contains examples of creating ArrayStream objects given a schema and a vector or list of arrays. That helps for chunks returned via, say, RecordBatchReader as this may not require the contiguous memory an unchunked approach would need. But as we instantiate with the whole vector (or list) we still require a similar total amount of memory at instantiation.

But can we create, say, a RecordBatchReader is a more 'streaming' fashion? Could be hand this back to the caller with only the initially-known list of Arrays and also support further data ? So say the first call of next() would be covered but thereafter a more 'lazy' approach is used and RecordBatchReader supplies updates in true batches. Obviously a more complicated setup, but is something like this feasible / supported / planned / ... ?

I may be explaining myself poorly here but are there other references in the Arrow context that handle this is as a more 'open' subscription (in the sense of 'total payload unknown at instantiation') with a later callbacks to provide chunked updates? Or do I have the wrong mental model and should rather think about, say, a pub/sub model where a 'middle man' holds on to the data and passes is along? (I have done such things with Redis.)

Thanks in advance for any pointers, and apologies for posting such a vague and rambling issue.

paleolimbot commented 8 months ago

Got it!

This existed in the initial version: https://github.com/paleolimbot/narrow/blob/master/R/array-stream-function.R + https://github.com/paleolimbot/narrow/blob/master/src/array-stream-function.c , although there are a few more helpers in nanoarrow that might help with this.

You can also do this in the Arrow package: https://github.com/apache/arrow/blob/main/r/src/recordbatchreader.cpp#L70-L106

In general it would be a very useful thing to be able to do!

eddelbuettel commented 8 months ago

Dang, now that you show it to me in narrow I think i recall having seen it there. And nice and elegant use of environments and function callbacks. (But we also have to think about how it could be done in Python off the same C level base layer.)

Overall seems like a good thing for say, a database wrapper (and when I briefly played with the adbc packages I noticed how they return a RecordBatchReader from queries but I have yet to dive in there and so more closely what they do).

Appreciate the quick and thoughtful reply.

WillAyd commented 2 months ago

Dang, now that you show it to me in narrow I think i recall having seen it there. And nice and elegant use of environments and function callbacks. (But we also have to think about how it could be done in Python off the same C level base layer.)

I've been doing youtube videos on nanoarrow, and I think this is partially covered in one of the videos:

https://www.youtube.com/watch?v=pW_WLvxmztQ&t

With accompanying code here (particularly the ProduceArray and ProduceStream functions):

https://github.com/WillAyd/bearly/pull/1/files

What isn't covered that you are asking for is the ability to stream more than just the initial recordbatch, but these may be easy to extend for that purpose

eddelbuettel commented 2 months ago

Thanks for the links to these, and for doing them -- they are well done. Now, I have (much) more of an R / C++ focus so I am not exactly your target audience on the Python output. That said, I am trying to get my Python-using colleagues to pay more attention to nanoarrow and your videos may be perfect as an intro (if they can sit down and watch these -- I was fading a little at the end).

(I have been doing a lot of similar obs videos for a class I teach a term a year, but I have not 'spliced' pieces together. Can you recommend an equally-easy piece of software?)

Again, thanks a lot for these. Well done. I'm also quite jealous about your languageserver integration into Emacs which is slicker than what I use (esp in mixed-language projects).

WillAyd commented 2 months ago

Thanks! Yea some of the videos are long (especially the third...). Hoping to make them more succinct going forward.

For the video I use OpenShot. I am not very skilled at video, but it seems pretty easy to use.

For Emacs I use eglot with the clangd language server

eddelbuettel commented 2 months ago

Did you mean obs studio there? I am happy with that (on Linux/Ubuntu) to record and even stream, I have not looked into 'combining' different snippets. You managed to 'splice' when waiting for sum() or with the inr32() / int64() error in the third. Can one do that editing in obs as well?

WillAyd commented 2 months ago

Sorry for the typo. It's openshot - https://www.openshot.org/

eddelbuettel commented 2 months ago

Perfect. Just got there or thereabouts via a quick google search.