apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.57k stars 3.54k forks source link

[cpp][parquet] Segmentation fault when open parquet file #13801

Open ex791857 opened 2 years ago

ex791857 commented 2 years ago

Hi, I am new to Arrow and Parquet. I installed Arrow 9.0.0 and Parquet following the guide in this repo.

When I tried to open the parquet file using follow CPP code (same as the example code in docs), I got a segmentation fault showing below when calling parquet::arrow::OpenFile

    arrow::Status st;
    arrow::MemoryPool *pool = arrow::default_memory_pool();
    arrow::fs::LocalFileSystem file_system;
    std::shared_ptr<arrow::io::RandomAccessFile> input =
        file_system.OpenInputFile("/root/path/to/file.parquet").ValueOrDie();

    // Open Parquet file reader
    std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
    st = parquet::arrow::OpenFile(input, pool, &arrow_reader);

I backtracked the stack info in gdb:

(gdb) bt
#0  0x00007ffff5ed1f74 in arrow::Schema::num_fields() const () from /lib64/libarrow.so.800
#1  0x00007ffff7788a41 in parquet::arrow::SchemaManifest::Make(parquet::SchemaDescriptor const*, std::shared_ptr<arrow::KeyValueMetadata const> const&, parquet::ArrowReaderProperties const&, parquet::arrow::SchemaManifest*) () from /lib64/libparquet.so.800
#2  0x00007ffff7755395 in parquet::arrow::FileReader::Make(arrow::MemoryPool*, std::unique_ptr<parquet::ParquetFileReader, std::default_delete<parquet::ParquetFileReader> >, parquet::ArrowReaderProperties const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) () from /lib64/libparquet.so.800
#3  0x00007ffff7755601 in parquet::arrow::FileReaderBuilder::Build(std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) () from /lib64/libparquet.so.800
#4  0x00007ffff77563b9 in parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile>, arrow::MemoryPool*, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >*) () from /lib64/libparquet.so.800
#5  0x000000000040a4b6 in main (argc=1, argv=0x7fffffffdba8) at /root/mini-sim/mdl/src/main.cpp:65
(gdb)

But I can open and read the same file using pandas in python

pandas.read_parquet("/root/path/to/file.parquet")

What did I miss? I really appreciate any help you can provide.

ex791857 commented 2 years ago

I also try this example. I got the same result as the above one, segmentation fault when calling parquet::arrow::OpenFile

    std::shared_ptr<arrow::io::ReadableFile> infile;
    PARQUET_ASSIGN_OR_THROW(infile,
                            arrow::io::ReadableFile::Open("/root/path/to/file.parquet",
                                                          arrow::default_memory_pool()));

    std::unique_ptr<parquet::arrow::FileReader> reader;
    PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
    std::shared_ptr<arrow::Table> table;
    PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
ex791857 commented 2 years ago

I looked into Arrow source code and tracked to

int Schema::num_fields() const { return static_cast<int>(impl_->fields_.size()); }

Maybe impl is a nullptr which causes a segmentation fault. But impl should be initialized when constructing the Schema

ex791857 commented 2 years ago

It turns out std::make_shared<arrow::Schema> is responsible for the segmentation fault. After I comment following code, the code could run normally.

    auto expected_schema = std::make_shared<arrow::Schema>(schema_vector);

    if (!expected_schema->Equals(*table->schema()))
    {
        // The table doesn't have the expected schema thus we cannot directly
        // convert it to our target representation.
        spdlog::error("Schemas in {} are not matching order!", parFile);
        exit(-1);
    }

But I still got questions about how could a statement that never runs affect this library memory?

This is an example code from official docs https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html

westonpace commented 2 years ago

Is there any chance you could share your complete application and a sample parquet file so that we could run it and try it out for ourselves?

ex791857 commented 2 years ago

@westonpace Of course. I have extracted a minimum example of the segmentation fault in this repo https://github.com/ex791857/arrow-report You can follow README to run this example. Feel free to try.

westonpace commented 2 years ago

Thank you so much for the detailed reproducer. Unfortunately, I have tried a few different variations (both just uncommenting the lines and actually running causeSegmenetationFault) and I have not been able to reproduce the issue. I am testing against the 9.0.0 release.

Do you know what version you are using?

Based on the strangeness of the error (e.g. triggered by code that is not executed) I wonder if the problem might be with memory corruption elsewhere or possibly by compiling and linking against different versions of the Arrow library.

Do you still get a segmentation fault when you run the example you shared?

Can you explain a bit more how you are compiling and linking Arrow (in particular, are you sure the header files you are compiling with come from the exact same version as the library you are linking with?)

ex791857 commented 2 years ago

I still get a segfault when I run the example. I compiled the example using CMakeList provided in the repo. There is only one version of arrow in my env.

Here is cmake find_package debug info

Checking file [/usr/lib64/cmake/arrow/ArrowConfig.cmake]
Checking file [/usr/lib64/cmake/arrow/ParquetConfig.cmake]

I check the version info in the above .cmake files, they are both 9.0.0

Here is my package info:

[root@3df06b313282 mdl]# yum list installed | grep arrow
apache-arrow-release.noarch         9.0.0-1.el7             installed           
arrow-dataset-devel.x86_64          9.0.0-1.el7             @apache-arrow-centos
arrow-dataset-glib-devel.x86_64     9.0.0-1.el7             @apache-arrow-centos
arrow-devel.x86_64                  9.0.0-1.el7             @apache-arrow-centos
arrow-glib-devel.x86_64             9.0.0-1.el7             @apache-arrow-centos
arrow9-dataset-glib-libs.x86_64     9.0.0-1.el7             @apache-arrow-centos
arrow9-dataset-libs.x86_64          9.0.0-1.el7             @apache-arrow-centos
arrow9-glib-libs.x86_64             9.0.0-1.el7             @apache-arrow-centos
arrow9-libs.x86_64                  9.0.0-1.el7             @apache-arrow-centos
parquet-devel.x86_64                9.0.0-1.el7             @apache-arrow-centos
parquet-glib-devel.x86_64           9.0.0-1.el7             @apache-arrow-centos
parquet9-glib-libs.x86_64           9.0.0-1.el7             @apache-arrow-centos
parquet9-libs.x86_64                9.0.0-1.el7             @apache-arrow-centos
pitrou commented 2 years ago

@ex791857 Can you check that you don't have another Arrow install somewhere? locate arrow/api.h might help locate it.

ex791857 commented 2 years ago

@pitrou There is only one arrow/api.h in my env.

# locate arrow/api.h
/usr/include/arrow/api.h

I had pyarrow in my env, but after pip uninstall pyarrow, this bug still exists. Maybe I can provide you with a docker later if I can reproduce this bug in it.