apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[Java] Unable to read S3 files using Arrow Dataset #34071

Open marun224 opened 1 year ago

marun224 commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

Even though all aws credentials are set, getting below error while trying to read s3 files using Arrow Dataset Java API. Please guide which are AWS properties needs to be set to work correctly.

java.lang.RuntimeException: When resolving region for bucket : AWS Error NETWORK_CONNECTION during HeadBucket operation: Encountered network error when sending http request
    at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method)
    at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:35)
    at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:31)
    at com.arrow.dataset.ParquetFileToArrowReader.readS3Parquet(ParquetFileToArrowReader.java:114)
    at com.arrow.dataset.ParquetFileToArrowReader.main(ParquetFileToArrowReader.java:36)

Component(s)

Java

kou commented 1 year ago

Could you share your code? Or the implementation of the process https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/filesystem/s3fs.cc#L546-L577 may help you.

marun224 commented 1 year ago

I am using the standard code in the Arrow java cookbook. Updated ~/.aws/credentials file is available in user directory.

Even tried setting the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN as per the link https://cran.r-project.org/web/packages/arrow/vignettes/fs.html. But still getting same error. Any help on this highly appreciated.

String uri = "s3://bucket_name/sub-directory";
ScanOptions options = new ScanOptions( 32768);
try (
    BufferAllocator allocator = new RootAllocator();
    DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
    Dataset dataset = datasetFactory.finish();
    Scanner scanner = dataset.newScan(options)
) {
    System.out.println(StreamSupport.stream(scanner.scan().spliterator(), false).count());
} catch (Exception e) {
    e.printStackTrace();
}
kou commented 1 year ago

Does String uri = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet"; work? Are you running your program on EC2? Does String uri = "s3://${access_key}:${secret_key}@${bucket_name}/${path}"; work?