apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

What purpose does ArrowRecordBatch solve? #14303

Open shivam-880 opened 2 years ago

shivam-880 commented 2 years ago

I was going through the Flight Java Example and was wondering if we can persist VectorSchemaRoot directly in the Dataset instead of ArrowRecordBatch list?

class Dataset implements AutoCloseable {
    private final List<ArrowRecordBatch> batches;
    private final Schema schema;
    private final long rows;
    public Dataset(List<ArrowRecordBatch> batches, Schema schema, long rows) {
        this.batches = batches;
        this.schema = schema;
        this.rows = rows;
    }
    public List<ArrowRecordBatch> getBatches() {
        return batches;
    }
    public Schema getSchema() {
        return schema;
    }
    public long getRows() {
        return rows;
    }
    @Override
    public void close() throws Exception {
        AutoCloseables.close(batches);
    }
}
lwhite1 commented 2 years ago

The record batch is a representation of a RecordBatch IPC message, which is used for Dataset transfer. A VectorSchemaRoot isn't implemented in a way that the conversion process could be skipped.

lidavidm commented 1 year ago

@iamsmkr do you still have questions here?