apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.38k forks source link

[Java] Exception when trying to load next batch while reading a parquet file #41893

Open vinayvenk opened 1 month ago

vinayvenk commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

Getting this exception when trying to load next batch while reading a parquet file. The parsing works if the batch size is big enough to process all the parquet contents in one shot. But if I try to give a smaller batch size , the code breaks giving the below exception

java.lang.IllegalArgumentException: should have as many children as in the schema: found 0 expected 8 at org.apache.arrow.util.Preconditions.checkArgument(Preconditions.java:282) at org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:127) at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:84) at org.apache.arrow.c.Data.importIntoVectorSchemaRoot(Data.java:334) at org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:151)

Component(s)

Java

amoeba commented 1 month ago

Hi @vinayvenk, what Arrow Java version are you on? Can you share the code you ran, ideally with code that can generate the data causing this?

vinayvenk commented 1 month ago

GM @amoeba it is pretty much standard code that i got from the example and I tried using 16.0.1, 16.0.0 and 15.0.1

String uri = parquetFile.toURI().toString(); ScanOptions options = new ScanOptions(/ batchSize / 32768); try (BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
Dataset dataset = datasetFactory.finish();
Scanner scanner = dataset.newScan(options);
ArrowReader reader = scanner.scanBatches()) {
int batchCount = 0;
while (reader.loadNextBatch()) {
try (VectorSchemaRoot root = reader.getVectorSchemaRoot()) { //function to create csv data createCSV() } it fails when it tries to load the second batch.