apache / arrow-java

Official Java implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
4 stars 4 forks source link

[Java] Enhancements for Java Dataset API #154

Open chitralverma opened 1 year ago

chitralverma commented 1 year ago

Describe the enhancement requested

Some important changes are suggested in the list below to improve the developer experience with the Dataset API of java/arrow. Most of these suggestions if implemented will lead to consistency with the pyarrow dataset API.

  1. Support for providing Filesystem options like access_key etc. programmatically. Currently only env vars are supported.
  2. Support for globbed paths and directories
  3. Excluding invalid files
  4. Additional documentation for already implemented functionality
    1. Reading and writing to remote/ cloud stores (HDFS, S3, GCS ...)
    2. Clarification of behaviour when reading multiple files. Why 2 or more files supplied, they may have different schema. Currently, only the schema of the last files is shown by .inspect() and this is not documented anywhere. This behaviour is the same in pyarrow. Maybe it's a good idea to allow users to provide a strategy like Error, Merge, LastFile etc.
    3. Reading and writing partitioned datasets
    4. Difference between FileSystemDatasetFactory.inspect() and FileSystemDatasetFactory.finish().newScan(...).schema(). Which one to use in which case?
    5. Env vars for Filesystem are not documented

Please let me know if the above make sense, I can help with PRs for the same.

Component(s)

Java

danepitkin commented 1 year ago

Hey @chitralverma , I'm in favor of improving the Java dataset APIs to provide similar functionality as pyarrow. They are both bindings to the C++ implementation so should be able to provide the same functionality.

Please let me know if the above make sense, I can help with PRs for the same.

Thank you, I look forward to your contributions!