apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[C++][Java] When i use DatasetFileWriter::write method write a file, I can't specify a file name #43489

Open shouriken opened 1 month ago

shouriken commented 1 month ago

Describe the usage question you have. Please include as many useful details as possible.

I use this static method to write a parquet file into fs, I give empty partition array, so it will be in a file; and i give the baseNameTemplate arg is "test.parquet" to specify the filename, but it leads to an error: basename_template did not contain '{i}'.

public static void write(BufferAllocator allocator, ArrowReader reader, FileFormat format, String uri,
                           String[] partitionColumns, int maxPartitions, String baseNameTemplate) {
    try (final ArrowArrayStream stream = ArrowArrayStream.allocateNew(allocator)) {
      Data.exportArrayStream(allocator, reader, stream);
      JniWrapper.get().writeFromScannerToFile(stream.memoryAddress(),
          format.id(), uri, partitionColumns, maxPartitions, baseNameTemplate);
    }
  }

read the cpp jni code and cpp dataset's code, the FileSystemDatasetWriteOptions::basename_template seems not supported to specify the file name without {i}.

/// \brief Options for writing a dataset.
struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
.
.
.
  /// Template string used to generate fragment basenames.
  /// {i} will be replaced by an auto incremented integer.
  std::string basename_template;
.
.
.
}

If there is any method to specify the filename when none partition?

Component(s)

Java

vibhatha commented 1 month ago

cc @felipecrv

felipecrv commented 1 month ago

This is the case because the names of files in a dataset have to follow a convention to ensure they can be read as a dataset. Maybe you should use directories and files directly to have your own conventions?

The {i} is how order is imposed in the list of files. The order is needed when reading the dataset files.

shouriken commented 1 month ago

Yes, I want to specify the path of target-file directly, write the whole dataset without column-partition to the specified target file. But I don't find a method to finish it, is there any method ready for it? Thx, @felipecrv

felipecrv commented 1 month ago

@vibhatha knows more about the Java APIs. It looks like filesystem APIs aren't exposed yet in Arrow Java. (?)