apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.13k stars 3.44k forks source link

[R] arrow failing on mac prerel #41267

Open tdhock opened 4 months ago

tdhock commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Hi! I expected that R arrow package should pass checks on all CRAN machines, including r-prerel-macos-arm64 I observe that the check is failing for R arrow, and also for my package nc which suggests arrow and uses it in an example:

 > ### Name: capture_first_glob
  > ### Title: capture first glob
  > ### Aliases: capture_first_glob
  > 
  > ### ** Examples
  > 
  > 
  > data.table::setDTthreads(1)
  > 
  > ## Example 0: iris data, one file per species.
  > library(data.table)
  > dir.create(iris.dir <- tempfile())
  > icsv <- function(sp)file.path(iris.dir, paste0(sp, ".csv"))
  > data.table(iris)[, fwrite(.SD, icsv(Species)), by=Species]
  Empty data.table (0 rows and 1 cols): Species
  > dir(iris.dir)
  [1] "setosa.csv"     "versicolor.csv" "virginica.csv" 
  > data.table::fread(file.path(iris.dir,"setosa.csv"), nrows=2)
     Sepal.Length Sepal.Width Petal.Length Petal.Width
            <num>       <num>        <num>       <num>
  1:          5.1         3.5          1.4         0.2
  2:          4.9         3.0          1.4         0.2
  > (iglob <- file.path(iris.dir,"*.csv"))
  [1] "/var/folders/k4/0jwzxmln0nb8y6rkzprptb640000gq/T//RtmpxNPzlE/file463c41145017/*.csv"
  > nc::capture_first_glob(iglob, Species="[^/]+", "[.]csv")
         Species Sepal.Length Sepal.Width Petal.Length Petal.Width
          <char>        <num>       <num>        <num>       <num>
    1:    setosa          5.1         3.5          1.4         0.2
    2:    setosa          4.9         3.0          1.4         0.2
    3:    setosa          4.7         3.2          1.3         0.2
    4:    setosa          4.6         3.1          1.5         0.2
    5:    setosa          5.0         3.6          1.4         0.2
   ---                                                            
  146: virginica          6.7         3.0          5.2         2.3
  147: virginica          6.3         2.5          5.0         1.9
  148: virginica          6.5         3.0          5.2         2.0
  149: virginica          6.2         3.4          5.4         2.3
  150: virginica          5.9         3.0          5.1         1.8
  > 
  > ## Example 1: four files, two capture groups, custom read function.
  > db <- system.file("extdata/chip-seq-chunk-db", package="nc", mustWork=TRUE)
  > suffix <- if(interactive())"gz" else "head"
  > (glob <- paste0(db, "/*/*/counts/*", suffix))
  [1] "/Volumes/Builds/packages/big-sur-arm64/results/4.4/nc.Rcheck/nc/extdata/chip-seq-chunk-db/*/*/counts/*head"
  > Sys.glob(glob)
  [1] "/Volumes/Builds/packages/big-sur-arm64/results/4.4/nc.Rcheck/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune/9/counts/McGill0101.bedGraph.head"
  [2] "/Volumes/Builds/packages/big-sur-arm64/results/4.4/nc.Rcheck/nc/extdata/chip-seq-chunk-db/H3K36me3_TDH_other/1/counts/McGill0019.bedGraph.head"
  [3] "/Volumes/Builds/packages/big-sur-arm64/results/4.4/nc.Rcheck/nc/extdata/chip-seq-chunk-db/H3K4me3_TDH_immune/9/counts/McGill0024.bedGraph.head"
  [4] "/Volumes/Builds/packages/big-sur-arm64/results/4.4/nc.Rcheck/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune/2/counts/McGill0024.bedGraph.head" 
  > read.bedGraph <- function(f)data.table::fread(
  +   f, skip=1, col.names = c("chrom","start", "end", "count"))
  > data.chunk.pattern <- list(
  +   data="H.*?",
  +   "/",
  +   chunk="[0-9]+", as.integer)
  > (data.chunk.dt <- nc::capture_first_glob(glob, data.chunk.pattern, READ=read.bedGraph))
                    data chunk  chrom     start       end count
                  <char> <int> <char>     <int>     <int> <int>
   1: H3K36me3_AM_immune     9  chr10 111456281 111456338     2
   2: H3K36me3_AM_immune     9  chr10 111456338 111456381     1
   3: H3K36me3_AM_immune     9  chr10 111456381 111459312     0
   4: H3K36me3_AM_immune     9  chr10 111459312 111459316     5
   5: H3K36me3_AM_immune     9  chr10 111459316 111459409    10
   6: H3K36me3_AM_immune     9  chr10 111459409 111459411     8
   7: H3K36me3_AM_immune     9  chr10 111459411 111459415     5
   8: H3K36me3_AM_immune     9  chr10 111459415 111463412     0
   9: H3K36me3_AM_immune     9  chr10 111463412 111463512     2
  10: H3K36me3_AM_immune     9  chr10 111463512 111466726     0
  11: H3K36me3_TDH_other     1  chr21  43119165  43119386     0
  12: H3K36me3_TDH_other     1  chr21  43119386  43119407     1
  13: H3K36me3_TDH_other     1  chr21  43119407  43119475     2
  14: H3K36me3_TDH_other     1  chr21  43119475  43119502     1
  15: H3K36me3_TDH_other     1  chr21  43119502  43119987     0
  16: H3K36me3_TDH_other     1  chr21  43119987  43120007     1
  17: H3K36me3_TDH_other     1  chr21  43120007  43120086     2
  18: H3K36me3_TDH_other     1  chr21  43120086  43120107     1
  19: H3K36me3_TDH_other     1  chr21  43120107  43120743     0
  20: H3K36me3_TDH_other     1  chr21  43120743  43120789     1
  21: H3K4me3_TDH_immune     9   chr1  36926536  36926549    10
  22: H3K4me3_TDH_immune     9   chr1  36926549  36926554     9
  23: H3K4me3_TDH_immune     9   chr1  36926554  36926565    11
  24: H3K4me3_TDH_immune     9   chr1  36926565  36926569     9
  25: H3K4me3_TDH_immune     9   chr1  36926569  36926571     8
  26: H3K4me3_TDH_immune     9   chr1  36926571  36926580     7
  27: H3K4me3_TDH_immune     9   chr1  36926580  36926593     8
  28: H3K4me3_TDH_immune     9   chr1  36926593  36926606     7
  29: H3K4me3_TDH_immune     9   chr1  36926606  36926622     8
  30: H3K4me3_TDH_immune     9   chr1  36926622  36926634     9
  31:  H3K4me3_XJ_immune     2  chr22  20688396  20688502     0
  32:  H3K4me3_XJ_immune     2  chr22  20688502  20688602     1
  33:  H3K4me3_XJ_immune     2  chr22  20688602  20688869     0
  34:  H3K4me3_XJ_immune     2  chr22  20688869  20688932     2
  35:  H3K4me3_XJ_immune     2  chr22  20688932  20688934     3
  36:  H3K4me3_XJ_immune     2  chr22  20688934  20688936     4
  37:  H3K4me3_XJ_immune     2  chr22  20688936  20688963     5
  38:  H3K4me3_XJ_immune     2  chr22  20688963  20688968     7
  39:  H3K4me3_XJ_immune     2  chr22  20688968  20688969     6
  40:  H3K4me3_XJ_immune     2  chr22  20688969  20688979     5
                    data chunk  chrom     start       end count
  > 
  > ## Write same data set in Hive partition, then re-read.
  > if(requireNamespace("arrow")){
  +   path <- tempfile()
  +   max_rows_per_file <- if(interactive())3 else 1000
  +   arrow::write_dataset(
  +     dataset=data.chunk.dt,
  +     path=path,
  +     format="csv",
  +     partitioning=c("data","chunk"),
  +     max_rows_per_file=max_rows_per_file)
  +   hive.glob <- file.path(path, "*", "*", "*.csv")
  +   hive.pattern <- list(
  +     nc::field("data","=",".*?"),
  +     "/",
  +     nc::field("chunk","=",".*?", as.integer),
  +     "/",
  +     nc::field("part","-","[0-9]+", as.integer))
  +   hive.dt <- nc::capture_first_glob(hive.glob, hive.pattern)
  +   hive.dt[, .(rows=.N), by=.(data,chunk,part)]
  + }
  Loading required namespace: arrow
  Error in dataset___HivePartitioning(schm, null_fallback = null_fallback_or_default(null_fallback),  : 
    Cannot call dataset___HivePartitioning(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries. 
  Calls: <Anonymous> -> <Anonymous> -> dataset___HivePartitioning
  Execution halted

It looks like arrow C++ library is not installed correctly, can you please investigate and fix?

For arrow check results https://cloud.r-project.org/web/checks/check_results_arrow.html I see it does not install

Version: 15.0.1
Check: whether package can be installed
Result: ERROR
  Installation failed.
Flavors: [r-prerel-macos-arm64](https://www.r-project.org/nosvn/R.check/r-prerel-macos-arm64/arrow-00check.html), [r-prerel-macos-x86_64](https://www.r-project.org/nosvn/R.check/r-prerel-macos-x86_64/arrow-00check.html)

Component(s)

R

assignUser commented 4 months ago

This is caused by gnulibtool being on the path. Brew specifically warns against doing this and it has to be added manually.

We have a check for this in 16.0.0(which currently in the release process) as this happened on the othe Mac platforms as well but was corrected silently by cran at some point.

dawsonv commented 3 months ago

I'm still encountering this issue with r-release-macos-arm64 15.0.1 on my 2020 M1 MacBook Air. I got arrow up and running with install_arrow(), like so:

install.packages("arrow") # install arrow
install_arrow() # reinstall arrow to fix issues
.rs.restartR() # restart R session (RStudio)

This isn't the expected/desired behavior, but it seems to work for now.

tdhock commented 1 month ago

this is still an issue for me (error during CRAN checks of my package nc which Suggests arrow) https://www.r-project.org/nosvn/R.check/r-release-macos-arm64/nc-00check.html If there is no way to fix it on your end, can you please tell me how I can write a code which tests the installed arrow binary if it is possible to call dataset___HivePartitioning() ? Right now my condition is

if(requireNamespace("arrow")){

but maybe I could change it to something like below?

if(requireNamespace("arrow") && arrow::binary_supports("dataset___HivePartitioning")){

Is that error message stable? (not likely to change in the future) If so then I could wrap everything in a tryCatch.

assignUser commented 1 month ago

Hm I was unable to reproduce the issue on my mac with the current 16.1.0 arrow binary from cran. Looking at the R version using R version 4.4.0 alpha (2024-03-31 r86238) maybe the runner doesn't have the recent arrow version but rather the previous version that was built without dataset support?

You can use arrow::arrow_with_dataset() in addition to requireNamespace to guard that section, that should resolve this issue even with an outdated version.