datafusion-contrib / datafusion-java

Java binding to Apache Arrow DataFusion
Apache License 2.0
66 stars 10 forks source link

option has_header true is ignored #146

Open alexradzin opened 2 months ago

alexradzin commented 2 months ago

I tried to run a simple example with CSV file that has headers.

name,age
Alice,29
Bob,31

So, I have created external table as following:

      context
          .sql("CREATE EXTERNAL TABLE test_table (name VARCHAR, age INT) STORED AS CSV LOCATION '/tmp/test/test.csv' OPTIONS ('has_header' 'true');")
          .thenComposeAsync(df -> df.collect(allocator))
          .join();

... and then executed query:

      context.sql("select * from test_table").thenComposeAsync(DataFrame::show).join();

As the result I got the following exception:

Exception in thread "main" java.util.concurrent.CompletionException: java.lang.RuntimeException: Arrow error: Parser error: Error while parsing value age for column 1 at line 0
    at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368)
    at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1152)
    at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483)
    at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
    at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
    at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
    at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
    at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.lang.RuntimeException: Arrow error: Parser error: Error while parsing value age for column 1 at line 0
    at org.apache.arrow.datafusion.DefaultDataFrame$RuntimeExceptionCallback.accept(DefaultDataFrame.java:127)
    at org.apache.arrow.datafusion.DefaultDataFrame$RuntimeExceptionCallback.accept(DefaultDataFrame.java:117)
    at org.apache.arrow.datafusion.DataFrames.showDataframe(Native Method)
    at org.apache.arrow.datafusion.DefaultDataFrame.show(DefaultDataFrame.java:70)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
    ... 6 more

I have also implemented my own "show()" method:

  private static void show(ArrowReader reader) {
    try {
      VectorSchemaRoot root = reader.getVectorSchemaRoot();
      System.out.println(root.getSchema().getFields());
      while (reader.loadNextBatch()) {
        int n = root.getFieldVectors().size();
        System.out.println(root.getFieldVectors().stream().map(v -> v.getField().getName() + ":" + v.getField().getFieldType().getType()).collect(Collectors.joining("|")));
        int rows =  root.getRowCount();
        for (int r = 0; r < rows; r++) {
          for (int i = 0; i < n; i++) {
            FieldVector nameVector = root.getVector(i);
            System.out.print(nameVector.getObject(r) + " | ");
          }
          System.out.println();
        }
      }
      reader.close();
    } catch (IOException e) {
      logger.warn("got IO Exception", e);
    }
  }

and used it as following:

      context
          .sql("select * from test_table")
          .thenComposeAsync(df -> df.collect(allocator))
          .thenAccept(ExampleMain::show)
          .join();

In this case the error message looks like this:

thread '<unnamed>' panicked at src/dataframe.rs:29:14:
failed to collect dataframe: ArrowError(ParseError("Error while parsing value age for column 1 at line 0"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

Both examples work if CSV file does not have header or if age column is defined as VARCHAR. In this case the code works but it reads header as a first line of the data. Attempt to use formant.has_header instead of has_header does not help.

Note that the same scenario works correctly for me with datafusion-cli. It looks that the OPTIONS ('has_header' 'true') is just ignored when running with datafusion-java. It is strange because as far as I can see datafusion-java is just a thin JNI wrapper over the native datafusion API.

I am running on Ubunty and using java 21 (if it matters).