apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.52k forks source link

[R] Error when reading parquet file using FileSystem object #12118

Open everron opened 2 years ago

everron commented 2 years ago

Hello,

I am encountering an issue when trying to read a parquet file using read_parquet with an S3FileSystem created with s3_bucket().

I created a worker that get the last parquet file id uploaded to an S3 bucket (using S3 api) and then trying to read the file by calling the $path() method with my filename as the arg. I created a custom function read_table_fromS3 to do this. This errors occurs after a second call to read_parquet() when running my script with Rscript within a docker container built with all dependencies needed and access to AWS credentials:

The error I catch :

Error in fs___FileSystem__OpenInputFile(self, clean_path_rel(path)) :
   ignoring SIGPIPE signal
Calls: read_table_fromS3 ... make_readable_file -> <Anonymous> -> fs___FileSystem__OpenInputFile

Here is a (non reproducible) sample of what my script is doing :

library(arrow)

s3target <- s3_bucket(my_aws_s3_bucket,
  access_key = my_aws_access_key,
  secret_key = my_aws_secret_key
)

# oversimplified version of my actual function
read_table_fromS3 <- function(s3bucket, id) {
  file_path <-  paste0(id, ".parquet")
  path <-  s3bucket$path(file_path)
  read_parquet(path)
}

# first read_parquet() call : OK
read_table_fromS3(s3target, <a file id>)

# get last file identifier
last_file_id <- get_last_file_id(...)

#
# instructions to test if last_file_id is the latest available (can take several minutes/hours)
#

# second read_parquet() call : KO
read_table_fromS3(s3target, last_file_id)
> Error in fs___FileSystem__OpenInputFile(self, clean_path_rel(path)) :
   ignoring SIGPIPE signal

Here is the sessionInfo() from my running container :

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] optparse_1.7.1      googlesheets4_1.0.0 lubridate_1.8.0
 [4] futile.logger_1.4.3 stringr_1.4.0       arrow_6.0.1
 [7] paws.storage_0.1.12 data.table_1.14.2   here_1.0.1
[10] glue_1.6.0

loaded via a namespace (and not attached):
 [1] cellranger_1.1.0     compiler_4.1.1       pillar_1.6.4
 [4] formatR_1.11         futile.options_1.0.1 tools_4.1.1
 [7] digest_0.6.29        bit_4.0.4            googledrive_2.0.0
[10] jsonlite_1.7.2       gargle_1.2.0         lifecycle_1.0.1
[13] tibble_3.1.6         pkgconfig_2.0.3      rlang_0.4.12
[16] yaml_2.2.1           curl_4.3.2           xml2_1.3.3
[19] httr_1.4.2           askpass_1.1          generics_0.1.1
[22] fs_1.5.2             vctrs_0.3.8          rprojroot_2.0.2
[25] bit64_4.0.5          tidyselect_1.1.1     getopt_1.20.3
[28] R6_2.5.1             fansi_0.5.0          paws.common_0.3.15
[31] purrr_0.3.4          lambda.r_1.2.4       magrittr_2.0.1
[34] ellipsis_0.3.2       assertthat_0.2.1     config_0.3.1
[37] utf8_1.2.2           stringi_1.7.6        openssl_1.4.6
[40] crayon_1.4.2
dragosmg commented 2 years ago

Hi @everron and thanks for raising this issue. We use Jira to track issues. Would you mind if we moved the conversation there?

everron commented 2 years ago

Hi, no problem. I found a workaround in the meantime but I was not sure if this was a real issue or not.

dragosmg commented 2 years ago

Thanks. I think ignoring SIGPIPE signal points to an issue with a CLI pipe. This likely happens when a connection is broken/closed/invalid, e.g. when an R worker crashes, and might not have anything to do with S3. "SIGPIPE means that R is trying to write somewhere which doesn't listen." (Simon Urbanek in this thread) More on the topic:

everron commented 2 years ago

This was my first thought since this error only occurred after a second connection attempt.

This is probably not related to S3 indeed. To solve the issue I added a retry if the call to read_parquet() fails. It's pretty ugly but I can not figure out how to maintain the connection with S3

sonomatechDS commented 1 year ago

I'm seeing a similar issue to this post (error thrown due to SIGPIPE), but when deploying a shiny app on either shinyapps.io or deploy.

The attached SO post documents well the exact behavior that I'm seeing, but I don't see how to find a workaround when I don't have control over command-line arguments. The SO poster found a difference in whether the AWS bucket was public or private, positing that the issue could be related to how Arrow maintains AWS credentialed access to a private bucket.

Any thoughts?

Link to SO post