AppendableRecordBatch - Githubissues

avantgardnerio commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

At Space and Time, we would like to make an OLTP columnar DB based on Arrow (see noisepage for precedent). Unfortunately, Arrow RecordBatches are immutable by design.

Describe the solution you'd like

Fortunately however, append-only and immutable are not necessarily at odds, as long as one takes immutable RecordBatch snapshots at a point-in-time of the append-only version. See this gist for a working PoC implementation:

Screenshot from 2022-11-19 09-46-31

Describe alternatives you've considered

using arrow2
trying some trickery with making the MutableBuffer work, vs preallocating ones from the immutable package

avantgardnerio commented 2 years ago

Edit: cleaner API with .extend(other: RecordBatch): https://gist.github.com/avantgardnerio/48d977ea6bd28c790cfb6df09250336d

tustvold commented 2 years ago

Like the general idea, just a couple of comments/questions:

If the extend payload is RecordBatch, what do you gain by concatenating them together? Why not just store them separately and periodically compact them? What do you gain from a single RecordBatch over saybVec<RecordBatch>?
Similar to the above, why is this an AppendableRecordBatch and not say AppendablePrimitiveArray, etc... This would be more flexible and avoid creating array temporaries
I'm not sure how support for booleans would work, unless you can only append multiples of 8
You need to know the maximum buffer lengths up front, as you can't realloc the buffers
Your comment suggests arrow2 supports this but can't see how, could you point me to it?

One potentially simpler way to implement something similar to this would be to add a non-consuming finish method to the builders. This would entail copying the buffers, but in my experience of implementing the write path for IOx, this copy is insignificant in the grand scheme of query execution - even ignoring the heavy hitters like sorts and groups, all kernels involve copying values to an output array. This is especially true if after a certain number of rows you rotate the builders and just keep the immutable RecordBatch, thereby bounding the copy. What do you think?

Edit: I created https://github.com/apache/arrow-rs/issues/3154 for this

alamb commented 2 years ago

Fortunately however, append-only and immutable are not necessarily at odds, as long as one takes immutable RecordBatch snapshots at a point-in-time of the append-only version.

I think the general pattern of

Append data ....
Snapshot (by copying into a read only RecordBatch)
keep appending data

Makes sense with arrow. In fact this is what we do in IOx on the write path.

This pattern can be done with the various *Builders -- such as https://docs.rs/arrow/27.0.0/arrow/array/type.Int64Builder.html

So something like:

builder.append_value(1);

// get an array to use in datafusion, etc
let array = builder.build();

// make a new builder and start building the second batch
builder = Int64Builder::new();
builder.append_value(2);
builder.append_value(3);

I think what @tustvold is suggesting with " non-consuming finish method to the builders" means you could do something like:

builder.append_value(1);

// get an array to use in datafusion, etc (by copying the underlying values)
let array = builder.snapshot();

// existing builder can be used to append 
builder.append_value(2);
builder.append_value(3);

I don't fully follow the proposal in https://gist.github.com/avantgardnerio/48d977ea6bd28c790cfb6df09250336d

As soon as you have this function:

    pub fn as_slice(&self) -> RecordBatch {
        self.record_batch.slice(0, self.len())
    }

This means that now multiple locations may be able to read that data so trying to update as well will likely be an exercise in frustration as you try and fight the rust compiler to teach it that somehow you have guaranteed that the memory remains valid and that concurrent read and modification are ok.

Maybe we could start with the "make a snapshot based copy" and if the copying is too much figure out some different approach.

avantgardnerio commented 2 years ago

Others might still find it useful, but I've convinced myself with how we are planning to handle MVCC we'll need to make copies anyway, so the builder approach is probably fine (given a non-consuming .finish()).

apache / arrow-rs

AppendableRecordBatch #3142