apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.16k stars 3.45k forks source link

[C++] Segfault when reading a Parquet file as a Dataset but not when read as an individual file #36807

Open thisisnic opened 1 year ago

thisisnic commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I'm using the Arrow R package version 12.0.1.1 and am getting segfault when trying to read a Parquet file. Here's the output with the debugger attached:

> library(fs)
library(arrow)
library(dplyr)
[New Thread 0x7ffff33ff640 (LWP 480350)]
[New Thread 0x7fffe99ff640 (LWP 480356)]
Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> all_files <- dir_ls("/data/nyc-taxi", recurse=TRUE)
parquet_files <- all_files[endsWith(all_files, "parquet")]
> parquet_files[86]
/data/nyc-taxi/year=2016/month=10/part-0.parquet
> ds <- open_dataset(parquet_files[86]) %>% head(6) %>% collect()
[New Thread 0x7fffe9007640 (LWP 480358)]
[New Thread 0x7fffe8806640 (LWP 480359)]
[New Thread 0x7fffd7b7f640 (LWP 480360)]
[New Thread 0x7fffd6b7f640 (LWP 480361)]
[New Thread 0x7fffd637e640 (LWP 480362)]
[New Thread 0x7fffd5b7d640 (LWP 480363)]
[New Thread 0x7fffd537c640 (LWP 480364)]
[New Thread 0x7fffd4b7b640 (LWP 480365)]
[New Thread 0x7fffcd7ff640 (LWP 480366)]
[New Thread 0x7fffccffe640 (LWP 480367)]
[New Thread 0x7fffb3fff640 (LWP 480368)]
[New Thread 0x7fffb37fe640 (LWP 480369)]
[New Thread 0x7fffb2ffd640 (LWP 480370)]
> nrow(ds)
[1] 6
> parquet_files[87]
/data/nyc-taxi/year=2016/month=11/part-0.parquet
> ds <- open_dataset(parquet_files[87]) %>% head(6) %>% collect()
> 
Thread 13 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffccffe640 (LWP 480367)]
0x00007ffff00fbf38 in arrow::internal::Executor::Submit<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(arrow::internal::TaskHints, arrow::StopToken, struct {...} &) (this=0x2e1c00000008, hints=..., stop_token=..., func=...) at /home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:159
159     ARROW_RETURN_NOT_OK(SpawnReal(hints, std::move(task), std::move(stop_token),

If I read in the file via read_parquet(), I don't have a problem and it loads fine. Happy to supply the file if necessary, though wasn't sure it's possible/desirable to attach a 150Mb file to an issue ticket.

Component(s)

C++

thisisnic commented 1 year ago

I tried again, and it read in fine without head(). When I tried again with head() then collect() it read in the data successfully and then segfaulted immediately after.

> open_dataset("/data/nyc-taxi/year=2016/month=11/part-0.parquet") %>% head() %>% collect()
# A tibble: 6 × 22
  vendor_name pickup_datetime     dropoff_datetime    passenger_count
  <chr>       <dttm>              <dttm>                        <int>
1 VTS         2016-11-10 20:14:06 2016-11-10 20:19:37               1
2 VTS         2016-11-10 20:14:06 2016-11-10 20:43:31               1
3 VTS         2016-11-10 20:14:06 2016-11-10 20:17:24               1
4 VTS         2016-11-10 20:14:06 2016-11-10 20:20:12               1
5 CMT         2016-11-10 20:14:07 2016-11-10 20:20:23               1
6 CMT         2016-11-10 20:14:07 2016-11-10 21:13:19               2
# ℹ 18 more variables: trip_distance <dbl>, pickup_longitude <dbl>,
#   pickup_latitude <dbl>, rate_code <chr>, store_and_fwd <chr>,
#   dropoff_longitude <dbl>, dropoff_latitude <dbl>, payment_type <chr>,
#   fare_amount <dbl>, extra <dbl>, mta_tax <dbl>, tip_amount <dbl>,
#   tolls_amount <dbl>, total_amount <dbl>, improvement_surcharge <dbl>,
#   congestion_surcharge <dbl>, pickup_location_id <int>,
#   dropoff_location_id <int>
> 
Thread 11 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbffff640 (LWP 480578)]
0x0000000000000000 in ?? ()
thisisnic commented 1 year ago

I've now tested this with 13.0.0 and get the same result.

mapleFU commented 1 year ago

Would you mind provide the backtrace or the reproduce way in C++ or Python? Since I'm not so familiar with R

thisisnic commented 1 year ago

I'm afraid I have no idea how to do it in C++ or Python. The backtrace is here though:

Thread 8 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffddbfe640 (LWP 490914)]
0x00007fffefafb182 in arrow::internal::Executor::Submit<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(arrow::internal::TaskHints, arrow::StopToken, struct {...} &) (this=0x2db800000008, hints=..., stop_token=..., func=...) at /home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:159
159     ARROW_RETURN_NOT_OK(SpawnReal(hints, std::move(task), std::move(stop_token),
(gdb) bt
#0  0x00007fffefafb182 in arrow::internal::Executor::Submit<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(arrow::internal::TaskHints, arrow::StopToken, struct {...} &) (this=0x2db800000008, hints=..., stop_token=..., func=...) at /home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:159
#1  0x00007fffefaf9ec9 in arrow::internal::Executor::Submit<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(struct {...} &) (
    this=0x2db800000008, func=...) at /home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:186
#2  0x00007fffefaf8b9b in arrow::internal::ParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >, struct {...} &, arrow::internal::Executor *) (inputs=std::vector of length 22, capacity 22 = {...}, func=..., executor=0x2db800000008)
    at /home/nic2/arrow/cpp/src/arrow/util/parallel.h:56
#3  0x00007fffefaf6c82 in arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&, arrow::internal::Executor*)::<lambda(size_t, std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(bool, std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>, std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >, struct {...} &, arrow::internal::Executor *) (use_threads=true, inputs=std::vector of length 0, capacity 0, func=..., 
    executor=0x2db800000008) at /home/nic2/arrow/cpp/src/arrow/util/parallel.h:91
#4  0x00007fffefaf35fd in parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups (this=0x7fffd4cf3330, 
    self=std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl> (use count 48, weak count 0) = {...}, row_groups=std::vector of length 1, capacity 1 = {...}, 
    column_indices=std::vector of length 22, capacity 22 = {...}, cpu_executor=0x2db800000008) at /home/nic2/arrow/cpp/src/parquet/arrow/reader.cc:1280
#5  0x00007fffefaf1fbd in parquet::arrow::RowGroupGenerator::ReadOneRowGroup (cpu_executor=0x2db800000008, self=std::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl> (use count 48, weak count 0) = {...}, 
    row_group=354, column_indices=std::vector of length 22, capacity 22 = {...}) at /home/nic2/arrow/cpp/src/parquet/arrow/reader.cc:1166
#6  0x00007fffefb06978 in parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}::operator()() const (__closure=0x7fffc4153b18) at /home/nic2/arrow/cpp/src/parquet/arrow/reader.cc:1139
#7  0x00007fffefb452e6 in arrow::detail::ContinueFuture::operator()<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}, , arrow::Future<std::function<arrow::Future<std::shared_ptr<arrow::RecordBatch> > ()> >, arrow::Future<std::function<arrow::Future<std::shared_ptr<arrow::RecordBatch> > ()> > >(arrow::Future<std::function<arrow::Future<std::shared_ptr<arrow::RecordBatch> > ()> >, parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}&&) const (this=0x7fffddbfd71f, next=..., f=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:178
#8  0x00007fffefb44443 in arrow::detail::ContinueFuture::IgnoringArgsIf<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}, arrow::Future<std::function<arrow::Future<std::shared_ptr<arrow::RecordBatch> > ()> >, arrow::internal::Empty const&>(std::integral_constant<bool, true>, arrow::Future<std::function<arrow::Future<std::shared_ptr<arrow::RecordBatch> > ()> >&&, parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}&&, arrow::internal::Empty const&) const (this=0x7fffddbfd71f, next=..., f=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:188
#9  0x00007fffefb4326a in arrow::Future<arrow::internal::Empty>::ThenOnComplete<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}, arrow::Future<arrow::internal::Empty>::PassthruOnFailure<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}> >::operator()(arrow::Result<arrow::internal::Empty> const&) && (this=0x7fffc4153b18, result=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:545
#10 0x00007fffefb41b0b in arrow::Future<arrow::internal::Empty>::WrapResultyOnComplete::Callback<arrow::Future<arrow::internal::Empty>::ThenOnComplete<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}, arrow::Future<arrow::internal::Empty>::PassthruOnFailure<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}> > >::operator()(arrow::FutureImpl const&) && (this=0x7fffc4153b18, impl=...)
    at /home/nic2/arrow/cpp/src/arrow/util/future.h:442
#11 0x00007fffefb413d1 in arrow::internal::FnOnce<void (arrow::FutureImpl const&)>::FnImpl<arrow::Future<arrow::internal::Empty>::WrapResultyOnComplete::Callback<arrow::Future<arrow::internal::Empty>::ThenOnComplete<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}, arrow::Future<arrow::internal::Empty>::PassthruOnFailure<parquet::arrow::RowGroupGenerator::FetchNext()::{lambda()#1}> > > >::invoke(arrow::FutureImpl const&) (
    this=0x7fffc4153b10, a#0=...) at /home/nic2/arrow/cpp/src/arrow/util/functional.h:152
#12 0x00007fffec14af78 in arrow::internal::FnOnce<void (arrow::FutureImpl const&)>::operator()(arrow::FutureImpl const&) && (this=0x7fffc4151f40, a#0=...) at /home/nic2/arrow/cpp/src/arrow/util/functional.h:140
#13 0x00007fffec14a5bd in arrow::ConcreteFutureImpl::RunOrScheduleCallback (self=std::shared_ptr<arrow::FutureImpl> (use count 2, weak count 1) = {...}, callback_record=..., in_add_callback=false)
    at /home/nic2/arrow/cpp/src/arrow/util/future.cc:109
#14 0x00007fffec14a916 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed (this=0x7fffd4e4d4b0, state=arrow::FutureState::SUCCESS) at /home/nic2/arrow/cpp/src/arrow/util/future.cc:147
#15 0x00007fffec149d39 in arrow::ConcreteFutureImpl::DoMarkFinished (this=0x7fffd4e4d4b0) at /home/nic2/arrow/cpp/src/arrow/util/future.cc:38
#16 0x00007fffec147dd0 in arrow::FutureImpl::MarkFinished (this=0x7fffd4e4d4b0) at /home/nic2/arrow/cpp/src/arrow/util/future.cc:193
#17 0x00007ffff102c506 in arrow::Future<arrow::internal::Empty>::DoMarkFinished (this=0x7fffc4141618, res=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:658
#18 0x00007ffff101a09c in arrow::Future<arrow::internal::Empty>::MarkFinished<arrow::internal::Empty, void> (this=0x7fffc4141618, s=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:409
#19 0x00007fffefb15d68 in arrow::internal::Executor::DoTransfer<arrow::internal::Empty, arrow::Future<arrow::internal::Empty>, arrow::Status>(arrow::Future<arrow::internal::Empty>, bool)::{lambda(arrow::Status const&)#1}::operator()(arrow::Status const&) (__closure=0x7fffc4141618, result=...) at /home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:231
#20 0x00007fffefb419dd in arrow::Future<arrow::internal::Empty>::WrapStatusyOnComplete::Callback<arrow::internal::Executor::DoTransfer<arrow::internal::Empty, arrow::Future<arrow::internal::Empty>, arrow::Status>(arrow::Future<arrow::internal::Empty>, bool)::{lambda(arrow::Status const&)#1}>::operator()(arrow::FutureImpl const&) && (this=0x7fffc4141618, impl=...) at /home/nic2/arrow/cpp/src/arrow/util/future.h:455
#21 0x00007fffefb41195 in arrow::internal::FnOnce<void (arrow::FutureImpl const&)>::FnImpl<arrow::Future<arrow::internal::Empty>::WrapStatusyOnComplete::Callback<arrow::internal::Executor::DoTransfer<arrow::internal::Empty, arrow::Future<arrow::internal::Empty>, arrow::Status>(arrow::Future<arrow::internal::Empty>, bool)::{lambda(arrow::Status const&)#1}> >::invoke(arrow::FutureImpl const&) (this=0x7fffc4141610, a#0=...)
    at /home/nic2/arrow/cpp/src/arrow/util/functional.h:152
#22 0x00007fffec14af78 in arrow::internal::FnOnce<void (arrow::FutureImpl const&)>::operator()(arrow::FutureImpl const&) && (this=0x7fffc4158048, a#0=...) at /home/nic2/arrow/cpp/src/arrow/util/functional.h:140
#23 0x00007fffec14a335 in arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool)::{lambda()#1}::operator()() (__closure=0x7fffc4158038)
    at /home/nic2/arrow/cpp/src/arrow/util/future.cc:105
#24 0x00007fffec15065e in arrow::internal::FnOnce<void ()>::FnImpl<arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool)::{lambda()#1}>::invoke() (this=0x7fffc4158030) at /home/nic2/arrow/cpp/src/arrow/util/functional.h:152
thisisnic commented 1 year ago

The input file is at this URL: "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com/nyc-taxi/year=2016/month=11/part-0.parquet"

I tried downloading it again to ensure it wasn't a bad download, but got the same error.

westonpace commented 1 year ago

I suspect this is related to https://github.com/apache/arrow/issues/31486 .

thisisnic commented 1 year ago

I suspect this is related to #31486 .

Thanks @westonpace! Would you mind expanding a bit? I read that issue but didn't really understand it - any idea what the likely source of the issue is?