apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.69k stars 3.56k forks source link

[R] Weird R error: Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : ignoring SIGPIPE signal #32026

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Okay apologies, this is a bit of a weird error but is annoying the heck out of me.  The following block of all R code, when run with Rscript (or embedded into any form of Rmd, quarto, knitr doc) produces the error below (at least most of the time):

 


library(arrow)
library(dplyr)

Sys.setenv(AWS_EC2_METADATA_DISABLED = "TRUE")
Sys.unsetenv("AWS_ACCESS_KEY_ID")
Sys.unsetenv("AWS_SECRET_ACCESS_KEY")
Sys.unsetenv("AWS_DEFAULT_REGION")
Sys.unsetenv("AWS_S3_ENDPOINT")s3 <- arrow::s3_bucket(bucket = "scores/parquet",
                       endpoint_override = "data.ecoforecast.org")
ds <- arrow::open_dataset(s3, partitioning = c("theme", "year"))
ds |> dplyr::filter(theme == "phenology") |> dplyr::collect()

Gives the error

 

 


Error in fs___FileSystem__GetTargetInfos_FileSelector(self, x) : 
  ignoring SIGPIPE signal
Calls: %>% ... <Anonymous> -> fs___FileSystem__GetTargetInfos_FileSelector 

But only when run as a script! When run interactively in an R console, this code runs just fine.  Even as a script the code seems to run fine, but erroneously seems to be attempting this sigpipe I don't understand.  

If the script is executed with litter (https://dirk.eddelbuettel.com/code/littler.html) then it runs fine, since littler handles sigpipe but Rscripts don't.  But I have no idea why the above code throws a pipe in the first place.  Worse, if I choose a different filter for the above, like "aquatics", it (usually) works without the error.  

I have no idea why fs___FileSystem__GetTargetInfos_FileSelector results in this, but would really appreciate any hints on how to avoid this as it makes it very hard to use arrow in workflows right now! 

 

thanks for all you do!

 

Reporter: Carl Boettiger / @cboettig

Note: This issue was originally created as ARROW-16680. Please see the migration documentation for further details.

asfimport commented 2 years ago

Nicola Crane / @thisisnic: Thanks for reporting this @cboettig.  Not sure what's going on here, looks like it could be similar to another issue we have open which is currently unresolved: https://github.com/apache/arrow/issues/12118#issuecomment-1027823802

I've seen something here about a completely different use case (not using Arrow); someone reading from a TSV file and it not having finished reading causes a similar error: https://www.mail-archive.com/r-help@r-project.org/msg261632.html

Out of interest, how many rows does the query return that causes the issue versus a couple that don't?

@pitrou - does anything come to mind here?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Intuitively, this does not seem to be a problem with Arrow, especially if it happens to other people in unrelated use cases.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Is this reproducible using the snippet above? @paleolimbot

asfimport commented 2 years ago

Nicola Crane / @thisisnic: @pitrou Perhaps not an Arrow issue, though I'm wondering if there's something about running it in a script and the reading from S3 being incomplete which is causing the issue.

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: Hmm...I get a few things depending on what I try. I can get this to run without error, and I also got the below error once:


code <- '
library(arrow)
library(dplyr)

Sys.setenv(AWS_EC2_METADATA_DISABLED = "TRUE")
Sys.unsetenv("AWS_ACCESS_KEY_ID")
Sys.unsetenv("AWS_SECRET_ACCESS_KEY")
Sys.unsetenv("AWS_DEFAULT_REGION")
Sys.unsetenv("AWS_S3_ENDPOINT")
s3 <- arrow::s3_bucket(
  bucket = "scores/parquet",
  endpoint_override = "data.ecoforecast.org"
)
ds <- arrow::open_dataset(s3, partitioning = c("theme", "year"))
print(ds |> dplyr::filter(theme == "phenology") |> dplyr::collect())
'

tf <- tempfile()
write(code, tf)

callr::rscript(tf)

Error in `dplyr::collect()`:
! Invalid: Could not open Parquet input source 'scores/parquet/phenology/2022/phenology-2022-05-05-PEG_FUSION_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:470  iterator_.Next()
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
Backtrace:
    ▆
 1. ├─base::print(dplyr::collect(dplyr::filter(ds, theme == "phenology")))
 2. ├─dplyr::collect(dplyr::filter(ds, theme == "phenology"))
 3. └─arrow:::collect.arrow_dplyr_query(...)
 4.   └─base::tryCatch(...) at r/R/dplyr-collect.R:22:2
 5.     └─base tryCatchList(expr, classes, parentenv, handlers)
 6.       └─base tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.         └─value[[3L]](cond)
 8.           └─arrow:::handle_csv_read_error(e, x$.data$schema, call) at r/R/dplyr-collect.R:27:6
 9.             └─rlang::abort(msg, call = call) at r/R/util.R:212:2
Execution halted
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
* Closing connection 0
Error in (function (command = NULL, args = character(), error_on_status = TRUE,  : 
  System command 'Rscript' failed, exit status: 1, stdout & stderr were printed
Type .Last.error.trace to see where the error occurred
asfimport commented 2 years ago

Carl Boettiger / @cboettig: Hi folks, thanks for testing, I know this is a weird issue.  It does not reproduce for me every time either, but most of the time. 

Some machines reproduce the error more frequently for me than others (I think ones with slightly lower-speed network connections).  If I remove the filter in the last command, it is a bit easier to reproduce the error.  (with no filter, the data has ~ 4M rows)

I think the error above about magic bytes is unrelated; I have seen that occasionally too, and I think it might just be some kind of chance lost packet / network error? 

 

The error trace from the SIGPIPE error is thrown by the arrow function, https://github.com/apache/arrow/blob/master/r/R/filesystem.R#L204

 

Anyway, I appreciate you taking a look at all at this weird issue.  Any other pointers on how to trace this down, or why fs_FileSystemGetTargetInfos_FileSelector might involve a sigpipe in the first place?  (really tricky without being able to reproduce in an interactive session!)

asfimport commented 2 years ago

Vitalie Spinu: I am not seeing this from fs__FileSystem_GetTargetInfos_FileSelector. In fact all I see is Error: ignoring SIGPIPE signal Execution halted which pops after my my entire R script completes. 

To my eye this comes from aws-> curl. This is dbg backtrace:


Thread 12 "R" received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffd37fe700 (LWP 207327)]
__libc_write (nbytes=31, buf=0x7fffc05828a3, fd=15) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) backtrace
#0  __libc_write (nbytes=31, buf=0x7fffc05828a3, fd=15) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=15, buf=0x7fffc05828a3, nbytes=31) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007fffec3ad459 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fffec3a863e in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fffec3a7654 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fffec3a7b17 in BIO_write () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fffec113dde in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fffec114cd9 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fffec11e88e in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fffec11ca65 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fffec127ec3 in SSL_shutdown () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fffec2d37c5 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#12 0x00007fffec2d3835 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#13 0x00007fffec2918ce in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#14 0x00007fffec294216 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#15 0x00007fffec2a6ecf in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#16 0x00007fffec2a7d31 in curl_multi_perform () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#17 0x00007fffec29e1bb in curl_easy_perform () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#18 0x00007fffee3e247b in Aws::Http::CurlHttpClient::MakeRequest(std::shared_ptr<Aws::Http::HttpRequest> const&, Aws::Utils::RateLimits::RateLimiterInterface*, Aws::Utils::RateLimits::RateLimiterInterface*) const ()
   from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#19 0x00007fffee1a721a in Aws::Client::AWSClient::AttemptOneRequest(std::shared_ptr<Aws::Http::HttpRequest> const&, Aws::AmazonWebServiceRequest const&, char const*, char const*, char const*) const ()
   from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#20 0x00007fffee1bc1a3 in Aws::Client::AWSClient::AttemptExhaustively(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const ()
   from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#21 0x00007fffee1bd448 in Aws::Client::AWSClient::MakeRequestWithUnparsedResponse(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const ()
   from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#22 0x00007fffee2e5933 in Aws::S3::S3Client::GetObject(Aws::S3::Model::GetObjectRequest const&) const () from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#23 0x00007fffedfb6234 in arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long, long, void*) () from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#24 0x00007fffedfb6931 in arrow::fs::(anonymous namespace)::ObjectInputFile::ReadAt(long, long) () from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#25 0x00007fffed378474 in arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<std::shared_ptr<arrow::Buffer> >, arrow::io::RandomAccessFile::ReadAsync(arrow::io::IOContext const&, long, long)::{lambda()#1})> >::invoke() () from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#26 0x00007fffed402ca7 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::_M_run() ()
   from /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/arrow/libs/arrow.so
#27 0x00007ffff59d0de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#28 0x00007ffff77c7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#29 0x00007ffff76ec133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I am using arrow's dataset on a S3 location and as the error does not occur on a specific call, I cannot apply the  catch-retry strategy. 

 

I have also seen this error when using Athena odbc driver from R. 

asfimport commented 2 years ago

Vitalie Spinu: Some more digging with more debug symbols reveals that the error happens in the finalizer.

I am a complete nob around these topics but it looks to me that at some point the finalizer is triggered and when Aws::S3::S3Client::S3Client is destructed curl still attempts to write to a disconnected socket in curl_easy_cleanup

 


Thread 1 "R" received signal SIGPIPE, Broken pipe.
__libc_write (nbytes=31, buf=0x555576a55b83, fd=43) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) break
break        break-range  
(gdb) backtrace 
#0  __libc_write (nbytes=31, buf=0x555576a55b83, fd=43) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=43, buf=0x555576a55b83, nbytes=31) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007fffe92b1459 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007fffe92ac63e in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007fffe92ab654 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007fffe92abb17 in BIO_write () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#6  0x00007fffe90f7dde in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007fffe90f8cd9 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007fffe910288e in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007fffe9100a65 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007fffe910bec3 in SSL_shutdown () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x00007fffe91cc7c5 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#12 0x00007fffe91cc835 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#13 0x00007fffe918a8ce in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#14 0x00007fffe91b795b in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#15 0x00007fffe919f336 in curl_multi_cleanup () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#16 0x00007fffe918ab43 in ?? () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#17 0x00007fffe91972ed in curl_easy_cleanup () from /usr/lib/x86_64-linux-gnu/libcurl.so.4
#18 0x00007fffec96aa14 in Aws::Http::CurlHandleContainer::~CurlHandleContainer (this=0x55556bd25ff8, __in_chrg=<optimized out>)
    at ~/vc/arrow/cpp/build/awssdk_ep-prefix/src/awssdk_ep/aws-cpp-sdk-core/source/http/curl/CurlHandleContainer.cpp:31
#19 0x00007fffec952377 in Aws::Http::CurlHttpClient::~CurlHttpClient (this=0x55556bd25f90, __in_chrg=<optimized out>) at /usr/include/c++/9/ext/new_allocator.h:89
#20 0x00007fffeabde0f2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55556bd25f80) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#21 0x00007fffec7f4ba5 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5555758352d0, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#22 std::__shared_ptr<Aws::Http::HttpClient, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5555758352c8, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#23 std::shared_ptr<Aws::Http::HttpClient>::~shared_ptr (this=0x5555758352c8, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#24 Aws::Client::AWSClient::~AWSClient (this=0x5555758352a0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/build/awssdk_ep-prefix/src/awssdk_ep/aws-cpp-sdk-core/include/aws/core/client/AWSClient.h:106
#25 Aws::Client::AWSXMLClient::~AWSXMLClient (this=0x5555758352a0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/build/awssdk_ep-prefix/src/awssdk_ep/aws-cpp-sdk-core/include/aws/core/client/AWSClient.h:401
#26 Aws::S3::S3Client::~S3Client (this=0x5555758352a0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/build/awssdk_ep-prefix/src/awssdk_ep/aws-cpp-sdk-s3/source/S3Client.cpp:160
#27 0x00007fffec52daf4 in arrow::fs::(anonymous namespace)::S3Client::~S3Client (this=0x5555758352a0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/filesystem/s3fs.cc:554
#28 0x00007fffec52eefd in __gnu_cxx::new_allocator<arrow::fs::(anonymous namespace)::S3Client>::destroy<arrow::fs::(anonymous namespace)::S3Client> (this=0x5555758352a0, __p=0x5555758352a0)
    at /usr/include/c++/9/ext/new_allocator.h:152
#29 0x00007fffec52eb6f in std::allocator_traits<std::allocator<arrow::fs::(anonymous namespace)::S3Client> >::destroy<arrow::fs::(anonymous namespace)::S3Client> (__a=..., __p=0x5555758352a0)
    at /usr/include/c++/9/bits/alloc_traits.h:496
#30 0x00007fffec52e4e1 in std::_Sp_counted_ptr_inplace<arrow::fs::(anonymous namespace)::S3Client, std::allocator<arrow::fs::(anonymous namespace)::S3Client>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x555575835290)
    at /usr/include/c++/9/bits/shared_ptr_base.h:557
#31 0x00007fffeabde0f2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555575835290) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#32 0x00007fffeabda9bb in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5555685ba510, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#33 0x00007fffec511cc4 in std::__shared_ptr<arrow::fs::(anonymous namespace)::S3Client, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5555685ba508, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#34 0x00007fffec511ce0 in std::shared_ptr<arrow::fs::(anonymous namespace)::S3Client>::~shared_ptr (this=0x5555685ba508, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#35 0x00007fffec57c38e in arrow::fs::S3FileSystem::Impl::~Impl (this=0x5555685ba0d0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/filesystem/s3fs.cc:1647
#36 0x00007fffec57c3e0 in __gnu_cxx::new_allocator<arrow::fs::S3FileSystem::Impl>::destroy<arrow::fs::S3FileSystem::Impl> (this=0x5555685ba0d0, __p=0x5555685ba0d0) at /usr/include/c++/9/ext/new_allocator.h:152
#37 0x00007fffec57aff5 in std::allocator_traits<std::allocator<arrow::fs::S3FileSystem::Impl> >::destroy<arrow::fs::S3FileSystem::Impl> (__a=..., __p=0x5555685ba0d0) at /usr/include/c++/9/bits/alloc_traits.h:496
#38 0x00007fffec57995f in std::_Sp_counted_ptr_inplace<arrow::fs::S3FileSystem::Impl, std::allocator<arrow::fs::S3FileSystem::Impl>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5555685ba0c0)
    at /usr/include/c++/9/bits/shared_ptr_base.h:557
#39 0x00007fffeabde0f2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555685ba0c0) at /usr/include/c++/9/bits/shared_ptr_base.h:155
#40 0x00007fffeabda9bb in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x555568408930, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#41 0x00007fffec536ec6 in std::__shared_ptr<arrow::fs::S3FileSystem::Impl, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x555568408928, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#42 0x00007fffec536ee6 in std::shared_ptr<arrow::fs::S3FileSystem::Impl>::~shared_ptr (this=0x555568408928, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#43 0x00007fffec51b3da in arrow::fs::S3FileSystem::~S3FileSystem (this=0x5555684088e0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/filesystem/s3fs.cc:2206
#44 0x00007fffec51b406 in arrow::fs::S3FileSystem::~S3FileSystem (this=0x5555684088e0, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/filesystem/s3fs.cc:2206
#45 0x00007fffec57ab32 in std::_Sp_counted_ptr<arrow::fs::S3FileSystem*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x555565dc4cb0) at /usr/include/c++/9/bits/shared_ptr_base.h:377
#46 0x00007fffee754ed8 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555565dc4cb0) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#47 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555565dc4cb0) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#48 0x00007fffee35eba9 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x555570637880, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#49 0x00007fffee37b412 in std::__shared_ptr<arrow::fs::FileSystem, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x555570637878, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#50 0x00007fffee37b45c in std::shared_ptr<arrow::fs::FileSystem>::~shared_ptr (this=0x555570637878, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#51 0x00007fffec49b75a in arrow::fs::SubTreeFileSystem::~SubTreeFileSystem (this=0x555570637810, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/filesystem/filesystem.cc:276
#52 0x00007fffee754ed8 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555570637800) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#53 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x555570637800) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#54 0x00007fffee35eba9 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x555568218770, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#55 0x00007fffee37b412 in std::__shared_ptr<arrow::fs::FileSystem, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x555568218768, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#56 0x00007fffee37b45c in std::shared_ptr<arrow::fs::FileSystem>::~shared_ptr (this=0x555568218768, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#57 0x00007fffee3c23d4 in arrow::dataset::FileSystemDataset::~FileSystemDataset (this=0x555568218720, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/dataset/file_base.h:222
#58 0x00007fffee3c2410 in arrow::dataset::FileSystemDataset::~FileSystemDataset (this=0x555568218720, __in_chrg=<optimized out>) at ~/vc/arrow/cpp/src/arrow/dataset/file_base.h:222
#59 0x00007fffee3d19e2 in std::_Sp_counted_ptr<arrow::dataset::FileSystemDataset*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55556a3ca000) at /usr/include/c++/9/bits/shared_ptr_base.h:377
#60 0x00007fffee7ad357 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55556a3ca000) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#61 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55556a3ca000) at /usr/include/c++/9/bits/shared_ptr_base.h:148
#62 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x555568170738, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:730
#63 std::__shared_ptr<arrow::dataset::Dataset, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x555568170730, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr_base.h:1169
#64 std::shared_ptr<arrow::dataset::Dataset>::~shared_ptr (this=0x555568170730, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/shared_ptr.h:103
#65 cpp11::default_deleter<std::shared_ptr<arrow::dataset::Dataset> > (obj=0x555568170730) at /path/to/renv/library/R-4.2/x86_64-pc-linux-gnu/cpp11/include/cpp11/external_pointer.hpp:17
#66 cpp11::external_pointer<std::shared_ptr<arrow::dataset::Dataset>, &cpp11::default_deleter<std::shared_ptr<arrow::dataset::Dataset> > >::r_deleter (p=<optimized out>) at /renv/library/R-4.2/x86_64-pc-linux-gnu/cpp11/include/cpp11/external_pointer.hpp:47
#67 cpp11::external_pointer<std::shared_ptr<arrow::dataset::Dataset>, &cpp11::default_deleter<std::shared_ptr<arrow::dataset::Dataset> > >::r_deleter (p=<optimized out>) at /renv/library/R-4.2/x86_64-pc-linux-gnu/cpp11/include/cpp11/external_pointer.hpp:36
#68 0x00005555556dcb77 in R_RunWeakRefFinalizer (w=<optimized out>) at memory.c:1500
#69 0x00005555556dcde3 in RunFinalizers () at memory.c:1567
#70 0x00005555556dcf85 in R_RunPendingFinalizers () at memory.c:1603
#71 0x00005555556937ba in bc_check_sigint () at eval.c:5550
#72 bcEval (body=<optimized out>, rho=<optimized out>, useCache=<optimized out>) at eval.c:6744
#73 0x000055555569cd58 in Rf_eval (e=0x5555678ed528, rho=rho@entry=0x55557677c728) at eval.c:748
#74 0x000055555569ea8f in R_execClosure (call=call@entry=0x555567b579f8, newrho=newrho@entry=0x55557677c728, sysparent=<optimized out>, rho=rho@entry=0x5555767b7008, arglist=arglist@entry=0x55557677c6b8, op=op@entry=0x5555678ed1a8) at eval.c:1918
asfimport commented 2 years ago

Carl Boettiger / @cboettig: Hi arrow devs, apologies that this one is hard to write a reprex for, but this issue is still killing me.  The issue happens when running as an external command – RScript, knit, now quarto as well, for most non-trivial scripts that touches S3 using arrow.  At the moment, the only successful workaround I've found has been using littler, r instead of RScript, which understands sigpipes and thus doesn't error under these conditions.  Unfortunately that does not help for standard workflows that rely on things like quarto or blogdown or the many other tools in the RStudio markdown ecosystem that all interpret this as an error. 

Here's another attempt at a reprex:


download.file("https://github.com/cboettig/forecasts-darts-framework/raw/main/weather-covariates.qmd", "sigpipe.qmd")
quarto::quarto_render("sigpipe.qmd")
# ...
#>  $ include: logi FALSE
#> 
#> 
#> [31moutput file: sigpipe.knit.md
#> 
#> [39m[31mError: ignoring SIGPIPE signal
#> Execution halted
#> [39m
#> Error in "processx::run(quarto_bin, args, echo = TRUE)": ! System command 'quarto' failed

Created on 2022-08-03 by the reprex package (v2.0.1) 

asfimport commented 2 years ago

Carl Boettiger / @cboettig: Additional note – the behavior seems to be specific to Linux.  Here's a GH-Actions task that runs the identical script on Linux, Mac, and Windows, reproducing this error on Linux but succeeding with the same script on Windows: https://github.com/cboettig/scratch/runs/7662520931?check_suite_focus=true .  Even then the issue is slightly difficult to reproduce, it occurs frequently but not every time, as visible in the GH-Actions log there.  Apologies, I realize that makes it hard to reproduce and debug.

asfimport commented 2 years ago

Carl Boettiger / @cboettig: FWIW, the sigpipe error also appears more likely in larger data – e.g. If I add an additional filter to run the same code but on a smaller subset of the data I can almost always avoid the error. Definitely a nuisance to debug.  Feels to me like there is some kind of race conditions behavior occurring, where the finalizer is closing a task before curl is done listening.  (maybe the discussion here is relevant: https://stackoverflow.com/questions/28915838/piping-rscript-gives-error-after-output, but I'm well out of my depth). 

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: Thanks for keeping on this!

FWIW, the deletion that seems to cause the sigpipe happens here: https://github.com/aws/aws-sdk-cpp/blob/main/aws-cpp-sdk-core/source/http/curl/CurlHandleContainer.cpp#L25-L33

...and there is a way to disable sigpipe errors that was broken at one point: https://github.com/curl/curl/issues/3138 . That issue described a race condition that happens when objects get deleted that triggers a sigpipe, which seems consistent with what you're seeing (intermittent failure coming from a deleter).

That fix looks like it was in CURL 7.62.0, and Ubuntu focal is at 7.68.0 at least (and you're running on newer Ubuntu than that).

It seems like adding curl_easy_setopt(curl, CURLOPT_NOSIGNAL); right here https://github.com/aws/aws-sdk-cpp/blob/main/aws-cpp-sdk-core/source/http/curl/CurlHandleContainer.cpp#L89 might work?

asfimport commented 2 years ago

Neal Richardson / @nealrichardson: Is it possible for @cboettig to set that curl option outside of the aws-sdk (like the curl timeout issue) and prove out that it works?

If so, at a minimum we could introduce a patch step in our aws-sdk-cpp build to add that in, as well as upstream the fix (IDK when we'll upgrade to the latest aws-sdk-cpp even if they do accept the PR).

asfimport commented 2 years ago

Vitalie Spinu: AWS does set the option https://github.com/aws/aws-sdk-cpp/blob/main/aws-cpp-sdk-core/source/http/curl/CurlHandleContainer.cpp#L142 to 1. The curl doc  is a bit confusing though:

 


Setting CURLOPT_NOSIGNAL to 1 makes libcurl NOT ask the system to
ignore SIGPIPE signals, which otherwise are sent by the system
when trying to send data to a socket which is closed in the
other end. libcurl makes an effort to never cause such SIGPIPEs
to trigger, but some operating systems have no way to avoid them
and even on those that have there are some corner cases when
they may still happen, contrary to our desire. In addition,
using CURLAUTH_NTLM_WB authentication could cause a SIGCHLD
signal to be raised.

It looks like there is no way to reliably avoid shose sigpipes. Hence, maybe the right approach would be to handle sigpies like plasma code does it ?

asfimport commented 2 years ago

Carl Boettiger / @cboettig: Hi folks, not to nag but this issue is still killing us.  It seems only to occur when accessing relatively large remote S3 data, and even then isn't 100% repeatable, but I can't avoid it by setting CURLOPT_NOSIGNAL.  This prevents us from using arrow in large automated workflows...

We can avoid it by running using littler instead of R / RScript, as littler can accept the sigpipe, but that is of no use in other tools that control how R is called,  such as quarto notebooks.  We reported this to the quarto team (https://github.com/quarto-dev/quarto-cli/issues/1667#issuecomment-1204554958) but after some trial mechanisms to avoid it JJ suggested it really needs to be resolved upstream instead...

asfimport commented 2 years ago

Dewey Dunnington / @paleolimbot: It sounds like ignoring sigpipe unconditionally (i.e., in Arrow C++ or AWS SDK C++ code) is generally considered a bad idea; however, ignoring it within the session where you're running into this problem is probably fine. I can't reproduce this locally but you could try something like this as a workaround:


cpp11::cpp_source(code = '
#include <csignal>
#include <cpp11.hpp>

[[cpp11::register]] void ignore_sigpipes() {
  signal(SIGPIPE, SIG_IGN);
}
')

ignore_sigpipes()
asfimport commented 2 years ago

Carl Boettiger / @cboettig: Wow, thanks Dewey!  That looks like black magic to me but I can definitely confirm that it works!

 

Still a bit stuck on the right thing to do in cases where we are providing user-facing packages that rely on arrow functions to access large external data, like you say I don't mind doing this in my scripts but it seems poor form to invisibly impose this on users where it may have side-effects with their other stuff?

vhh711 commented 1 year ago

Hi all,

I've encountered this same issue in work we're doing to convert data to parquet files as part of our data pipeline. I'm curious if any progress has been made on a resolution? Much like @cboettig , we're unable to pinpoint exactly what causes the SIGPIPE error.

image (18)

We first encountered the error in deploying arrow to Posit Connect, but subsequently encountered the error when knitting locally.

Edited: We did implement the workaround mentioned here[https://github.com/apache/arrow/issues/32026#issuecomment-1378103197]. We had previously isolated the code chunk that seemed to spur the error, so we inserted this code chunk prior to the error-causing chunk. image

Implementing this work around did allow us to successfully knit our document.

cboettig commented 1 year ago

Yup, we're relying on the above work-around too. @vhh711 is your code also accessing remote data (e.g. over the S3 bucket interface?) By the way, my understanding from the arrow team is that this issue comes from how the Amazon SDK chooses to handle sigpipe signals from the curl C library, and it's not obviously a bug, so it's not really an issue the arrow devs can do much about.

I understand why arrow doesn't want to just go about setting this ignore behavior by default, but I do wonder if it might at least be nice to have the CPP decor trick to ignore sigpipes as an opt-in ignore_sigpipe() function within the R package. That would probably make it easier for users encountering the error to discover and implement this work-around at least?

vhh711 commented 1 year ago

@cboettig Seems I spoke too soon. While the resolution solved the problem locally, once we deployed the resolution on Posit Connect (where our analytics content is hosted), we received the same error.

The process we have set up does involve retrieving data from and saving to S3.

I do agree with your suggestion about an opt-in function. This was quite the rabbit to chase down, and having something in the R package documentation would have been immensely helpful!

mderoy commented 1 year ago

We also hit this issue, but with pure c++ application (no R). Our scenario with many parallel reads and high volumes of data reproduces this issue pretty consistently. (we're using arrow 6.0.1 atm though, which is a bit old)

westonpace commented 1 year ago

Just scanning through this issue again do we think this is possibly a shutdown bug? In other words, is this crash always happening during shutdown? One of the stack traces above seemed to be happening at shutdown. It would also explain the "crashes when run as a script but not when run interactively" behavior.

westonpace commented 1 year ago

Does adding a sleep at the end of the script fix it (not that this is a great workaround either)?