Open ericphanson opened 1 year ago
The extension approach so far has been focused on dispatching on the fields of the batch item (e.g. the batch_channels
can be something other than a Vector{String}
). But yeah I can see how that doesn't work at the level of batch assembly like you're talking about here.
I want to use map instead of asyncmap since my materialize_batch_item is doing a bit more work and it's not IO bound (I think)
is there a downside to using asyncmap
in your case? I could forsee if each batch item is chunky then you may need to do them sequentially to fit in RAM but otherwise I can't really see a downside.
I don't want to actually return Arrays yet but rather a Vector{Samples} and an array, since I still want the samples wrapper for a bit longer
this one is harder, and I think your idea of wrapping the output of iterate_batch
from your custom batches struct in a type you control is probably the best option. Looks like that would require (at the moment) an additional bit of indirection to make get_channel_data
return the samples instead of the array, but that can be controlled with dispatch on batch_channels
.
For asyncmap
, I am calling into XGBoost to try to get some heuristic labels via another model (…really not sure this will help 😄) and I don’t really know how well the C wrapper handles concurrency.
I also would like to pass an RNG through materialize_batch
to do some augmentation there.
IMO the Right Way to handle the augmentation threading is in a "channel selector" type that "freezes" the RNG state/augmentation parameters at iterate_batch
time (in general, we can't guarantee that materialize_batch
happens in any particular order, so the batch itself needs to contain all the state necessary to reproducibly load the data if that's important).
I don’t really know how well the C wrapper handles concurrency.
I guess whether that's an issue will depend on whether the julia scheduler will switch to another task while teh C call is happening.
I think it also depends how well global state etc is managed in the C lib. Bc even if Julia doesn’t task switch at ccall, it could in-between setup and execution calls (if there are, I haven’t looked into it). Hopefully it would be ok.
It would be nice to be able to extend it, in case e.g. something else needs to happen when collecting individual batch items for a particular type of Batcher or BatchItem. I think the issue is now we have a table instead of a single item, so we can't just dispatch off the item type. One option is allowing users to pass a singleton type to use for dispatch which gets passed around, or something like that.
In my case, I have 2 needs that the current method doesn't do:
map
instead ofasyncmap
since mymaterialize_batch_item
is doing a bit more work and it's not IO bound (I think)Array
s yet but rather aVector{Samples}
and an array, since I still want the samples wrapper for a bit longer. [This part I could avoid by refactoring to move more work into this portion since I do need a raw array in the end, but it would be nice to be able to do that more incrementally without piracy]edit: also realized
materialize_batch_item
tends to need piracy too, since it comes in as aNamedTuple
, not aBatchItem
(or such) object, at least w the Legolas v0.4 branch