Make `materialize_batch` extendable without piracy

ericphanson commented 1 year ago

It would be nice to be able to extend it, in case e.g. something else needs to happen when collecting individual batch items for a particular type of Batcher or BatchItem. I think the issue is now we have a table instead of a single item, so we can't just dispatch off the item type. One option is allowing users to pass a singleton type to use for dispatch which gets passed around, or something like that.

In my case, I have 2 needs that the current method doesn't do:

I want to use map instead of asyncmap since my materialize_batch_item is doing a bit more work and it's not IO bound (I think)
I don't want to actually return Arrays yet but rather a Vector{Samples} and an array, since I still want the samples wrapper for a bit longer. [This part I could avoid by refactoring to move more work into this portion since I do need a raw array in the end, but it would be nice to be able to do that more incrementally without piracy]

edit: also realized materialize_batch_item tends to need piracy too, since it comes in as a NamedTuple, not a BatchItem (or such) object, at least w the Legolas v0.4 branch

kleinschmidt commented 1 year ago

The extension approach so far has been focused on dispatching on the fields of the batch item (e.g. the batch_channels can be something other than a Vector{String}). But yeah I can see how that doesn't work at the level of batch assembly like you're talking about here.

I want to use map instead of asyncmap since my materialize_batch_item is doing a bit more work and it's not IO bound (I think)

is there a downside to using asyncmap in your case? I could forsee if each batch item is chunky then you may need to do them sequentially to fit in RAM but otherwise I can't really see a downside.

I don't want to actually return Arrays yet but rather a Vector{Samples} and an array, since I still want the samples wrapper for a bit longer

this one is harder, and I think your idea of wrapping the output of iterate_batch from your custom batches struct in a type you control is probably the best option. Looks like that would require (at the moment) an additional bit of indirection to make get_channel_data return the samples instead of the array, but that can be controlled with dispatch on batch_channels.

ericphanson commented 1 year ago

For asyncmap, I am calling into XGBoost to try to get some heuristic labels via another model (…really not sure this will help 😄) and I don’t really know how well the C wrapper handles concurrency.

I also would like to pass an RNG through materialize_batch to do some augmentation there.

kleinschmidt commented 1 year ago

IMO the Right Way to handle the augmentation threading is in a "channel selector" type that "freezes" the RNG state/augmentation parameters at iterate_batch time (in general, we can't guarantee that materialize_batch happens in any particular order, so the batch itself needs to contain all the state necessary to reproducibly load the data if that's important).

kleinschmidt commented 1 year ago

I don’t really know how well the C wrapper handles concurrency.

I guess whether that's an issue will depend on whether the julia scheduler will switch to another task while teh C call is happening.

ericphanson commented 1 year ago

I think it also depends how well global state etc is managed in the C lib. Bc even if Julia doesn’t task switch at ccall, it could in-between setup and execution calls (if there are, I haven’t looked into it). Hopefully it would be ok.

beacon-biosignals / OndaBatches.jl

Make `materialize_batch` extendable without piracy #18