apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[R][C++] Reporting progress from copy_files()? #30157

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Would it be possible to have something that reports progress from copy_files() which calls CopyFiles from FileSystem?  When copying huge files, the R session just hangs and the user doesn't know if it's working or not.

Reporter: Nicola Crane / @thisisnic

Note: This issue was originally created as ARROW-14611. Please see the migration documentation for further details.

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot:

I took the opportunity to learn a bit about the C++ sources! I didn’t find a way to put a callback into CopyFiles that gives any progress info but perhaps there is one. C++ source from the R package: https://github.com/apache/arrow/blob/master/r/src/filesystem.cpp#L267-L275 Implementation for arrow::fs::CopyFiles: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/filesystem.cc#L586-L607


library(cpp11)
Sys.setenv(
  PKG_CXXFLAGS = paste0("-I", Sys.getenv("ARROW_HOME"), "/include"),
  PKG_LIBS = paste0("-L", Sys.getenv("ARROW_HOME"), "/lib", " -larrow")
)

cpp11::cpp_source(code = '

#include <cpp11.hpp>
#include <arrow/filesystem/api.h>
using namespace cpp11;
using namespace arrow;

[[cpp11::register]]
void copy_files2(std::string src_dir, std::string dst_dir) {
  auto fs = std::make_shared<fs::LocalFileSystem>();
  fs::FileSelector source_sel;
  source_sel.base_dir = src_dir;

  Status status = fs::CopyFiles(fs, source_sel, fs, dst_dir);

  if (!status.ok()) {
    std::string s = status.ToString();
    stop("%s", s.c_str());
  }
}

')

source_dir <- tempfile()
dest_dir <- tempfile()

dir.create(source_dir)
for (i in 1:1000) {
  write(
    as.character(1:i),
    sprintf("%s/file%03d.txt", source_dir, i)
  )
}
dir.create(dest_dir)

copy_files2(source_dir, dest_dir)
waldo::compare(list.files(source_dir), list.files(dest_dir))
#> ✓ No differences

If there is a way to do this with a callback, C++ progress bars using the progress package might be useful? https://github.com/r-lib/progress#c-api

asfimport commented 2 years ago

Weston Pace / @westonpace: I'll express some reluctance here in adding to the C++ impl. Arrow isn't really meant to be a cross-platform filesystem library. In the past we've shied away from adding things to the filesystem abstraction that aren't strictly needed by Arrow's internals. Keep in mind that any capability added to one filesystem needs to be added to all the rest (e.g. S3, GCS, and any fsspect compatible filesystems that users have already created).

However, you should be able to implement this yourself by opening an input file and output file and manually doing the copy.

In fact, CopyFiles doesn't appear to be used internally anywhere (or CopyFile for that matter). It might actually make sense to remove these capabilities from the filesystem abstraction.