Open ex791857 opened 2 years ago
I also try this example. I got the same result as the above one, segmentation fault when calling parquet::arrow::OpenFile
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(infile,
arrow::io::ReadableFile::Open("/root/path/to/file.parquet",
arrow::default_memory_pool()));
std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
I looked into Arrow source code and tracked to
int Schema::num_fields() const { return static_cast<int>(impl_->fields_.size()); }
Maybe impl is a nullptr which causes a segmentation fault. But impl should be initialized when constructing the Schema
It turns out std::make_shared<arrow::Schema>
is responsible for the segmentation fault. After I comment following code, the code could run normally.
auto expected_schema = std::make_shared<arrow::Schema>(schema_vector);
if (!expected_schema->Equals(*table->schema()))
{
// The table doesn't have the expected schema thus we cannot directly
// convert it to our target representation.
spdlog::error("Schemas in {} are not matching order!", parFile);
exit(-1);
}
But I still got questions about how could a statement that never runs affect this library memory?
This is an example code from official docs https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
Is there any chance you could share your complete application and a sample parquet file so that we could run it and try it out for ourselves?
@westonpace Of course. I have extracted a minimum example of the segmentation fault in this repo https://github.com/ex791857/arrow-report You can follow README to run this example. Feel free to try.
Thank you so much for the detailed reproducer. Unfortunately, I have tried a few different variations (both just uncommenting the lines and actually running causeSegmenetationFault) and I have not been able to reproduce the issue. I am testing against the 9.0.0 release.
Do you know what version you are using?
Based on the strangeness of the error (e.g. triggered by code that is not executed) I wonder if the problem might be with memory corruption elsewhere or possibly by compiling and linking against different versions of the Arrow library.
Do you still get a segmentation fault when you run the example you shared?
Can you explain a bit more how you are compiling and linking Arrow (in particular, are you sure the header files you are compiling with come from the exact same version as the library you are linking with?)
I still get a segfault when I run the example. I compiled the example using CMakeList provided in the repo. There is only one version of arrow in my env.
Here is cmake find_package debug info
Checking file [/usr/lib64/cmake/arrow/ArrowConfig.cmake]
Checking file [/usr/lib64/cmake/arrow/ParquetConfig.cmake]
I check the version info in the above .cmake files, they are both 9.0.0
Here is my package info:
[root@3df06b313282 mdl]# yum list installed | grep arrow
apache-arrow-release.noarch 9.0.0-1.el7 installed
arrow-dataset-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow-dataset-glib-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow-glib-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow9-dataset-glib-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow9-dataset-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow9-glib-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
arrow9-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
parquet-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
parquet-glib-devel.x86_64 9.0.0-1.el7 @apache-arrow-centos
parquet9-glib-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
parquet9-libs.x86_64 9.0.0-1.el7 @apache-arrow-centos
@ex791857 Can you check that you don't have another Arrow install somewhere? locate arrow/api.h
might help locate it.
@pitrou There is only one arrow/api.h
in my env.
# locate arrow/api.h
/usr/include/arrow/api.h
I had pyarrow in my env, but after pip uninstall pyarrow
, this bug still exists.
Maybe I can provide you with a docker later if I can reproduce this bug in it.
Hi, I am new to Arrow and Parquet. I installed Arrow 9.0.0 and Parquet following the guide in this repo.
When I tried to open the parquet file using follow CPP code (same as the example code in docs), I got a segmentation fault showing below when calling
parquet::arrow::OpenFile
I backtracked the stack info in gdb:
But I can open and read the same file using pandas in python
What did I miss? I really appreciate any help you can provide.