Open asfimport opened 4 years ago
Wes McKinney / @wesm: Can you share a foo.pq that exhibits the problem?
Matt Calder: I attached an example of foo.pq. In case it isn't clear from my description of the problem, it is necessary to make the odbc connection in order to trigger the error. Just reading the parquet file works in both 0.25.3 and 1.0.1. Only when the odbc connection is made does the reading lead to a segfault, and only in pandas 1.0.1. I wrote foo.pq using both 0.25.3 and 1.0.1 and in both cases I saw the segfault in 1.0.1 and not in 0.25.3, long winded way of saying I think it is the read not the write that is the problem. That said, the files do differ:
xbk@499e30e4f63f:~$ diff foo_101.pq foo_25.pq
Binary files foo_101.pq and foo_25.pq differ
Matt Calder: If it would help, I can build pyarrow with debugging symbols and get a more detailed stack trace.
Wes McKinney / @wesm: That would help. If you can provide a pickle of the offending DataFrame prior to be written to Parquet that would also help
Matt Calder: I added a pickle of the dataframe, as created in 1.0.1.
Matt Calder: I rebuilt pyarrow with debug symbols and now the backtrace has line numbers. I'm only pasting the first 28 levels of the stack below. The last point in arrow code is:
In arrow/cpp/src/parquet/metadata.cc:792
ApplicationVersion::ApplicationVersion(const std::string& created_by) {
regex app_regex{ApplicationVersion::APPLICATION_FORMAT};
Here is the stacktrace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2 0x00007ffff63c1957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff63c7ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff63c7af1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff63c7d24 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff63c6a52 in __cxa_bad_cast () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007ffff64131ec in std::__cxx11::collate<char> const& std::use_facet<std::__cxx11::collate<char> >(std::locale const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007fffb7bedd4e in std::__cxx11::regex_traits<char>::transform<char*> (this=0x12c7570, __first=0x10e36c0 "", __last=0x10e36c1 "\203\024\001") at /usr/include/c++/7/bits/regex.h:233
#9 0x00007fffb7beb6b7 in std::__cxx11::regex_traits<char>::transform_primary<char const*> (this=0x12c7570, __first=0x7fffffffacb8 "", __last=0x7fffffffacb9 "") at /usr/include/c++/7/bits/regex.h:266
#10 0x00007fffb7be6c14 in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_apply(char, std::integral_constant<bool, false>) const::{lambda()#1}::operator()() const (
__closure=0x7fffffffacb0) at /usr/include/c++/7/bits/regex_compiler.tcc:626
#11 0x00007fffb7be6da7 in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_apply (this=0x7fffffffae10, __ch=0 '\000')
at /usr/include/c++/7/bits/regex_compiler.tcc:634
#12 0x00007fffb7be21be in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_make_cache (this=0x7fffffffae10) at /usr/include/c++/7/bits/regex_compiler.h:556
#13 0x00007fffb7bddeb5 in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_ready (this=0x7fffffffae10) at /usr/include/c++/7/bits/regex_compiler.h:525
#14 0x00007fffb7bda724 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_insert_character_class_matcher<false, false> (this=0x7fffffffb250)
at /usr/include/c++/7/bits/regex_compiler.tcc:414
#15 0x00007fffb7bd6687 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_atom (this=0x7fffffffb250) at /usr/include/c++/7/bits/regex_compiler.tcc:327
#16 0x00007fffb7bd3775 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_term (this=0x7fffffffb250) at /usr/include/c++/7/bits/regex_compiler.tcc:139
#17 0x00007fffb7bd0c36 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative (this=0x7fffffffb250) at /usr/include/c++/7/bits/regex_compiler.tcc:121
#18 0x00007fffb7bd0c59 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative (this=0x7fffffffb250) at /usr/include/c++/7/bits/regex_compiler.tcc:124
#19 0x00007fffb7bce50e in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_disjunction (this=0x7fffffffb250) at /usr/include/c++/7/bits/regex_compiler.tcc:97
#20 0x00007fffb7bcc0f9 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_Compiler (this=0x7fffffffb250,
__b=0x7fffb7c92c70 "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)", __e=0x7fffb7c92cb8 "", __loc=..., __flags=(unknown: 16))
at /usr/include/c++/7/bits/regex_compiler.tcc:82
#21 0x00007fffb7bc98bc in std::__detail::__compile_nfa<char const*, std::__cxx11::regex_traits<char> > (
__first=0x7fffb7c92c70 "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)", __last=0x7fffb7c92cb8 "", __loc=..., __flags=(unknown: 16))
at /usr/include/c++/7/bits/regex_compiler.h:203
#22 0x00007fffb7bc62e4 in std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex<char const*> (this=0x7fffffffb510,
__first=0x7fffb7c92c70 "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)", __last=0x7fffb7c92cb8 "", __loc=..., __f=(unknown: 16))
at /usr/include/c++/7/bits/regex.h:767
#23 0x00007fffb7bc1abb in std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex<char const*> (this=0x7fffffffb510,
__first=0x7fffb7c92c70 "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)", __last=0x7fffb7c92cb8 "", __f=(unknown: 16)) at /usr/include/c++/7/bits/regex.h:512
#24 0x00007fffb7bbcd66 in std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> >::basic_regex (this=0x7fffffffb510,
__p=0x7fffb7c92c70 "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)", __f=(unknown: 16)) at /usr/include/c++/7/bits/regex.h:445
#25 0x00007fffb7bb200f in parquet::ApplicationVersion::ApplicationVersion (this=0x7fffffffb750, created_by="parquet-cpp version 1.5.1-SNAPSHOT") at /repos/arrow/cpp/src/parquet/metadata.cc:792
#26 0x00007fffb7bb6be0 in parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0xec6df0, metadata=0x7fffbd63d120, metadata_len=0x7fffffffbb04,
decryptor=std::shared_ptr<parquet::Decryptor> (empty) = {...}) at /repos/arrow/cpp/src/parquet/metadata.cc:462
#27 0x00007fffb7bb1449 in parquet::FileMetaData::FileMetaData (this=0xed8c40, metadata=0x7fffbd63d120, metadata_len=0x7fffffffbb04, decryptor=std::shared_ptr<parquet::Decryptor> (empty) = {...})
at /repos/arrow/cpp/src/parquet/metadata.cc:651
Wes McKinney / @wesm: Certainly strange, thanks. cc @pitrou @fsaintjacques
Antoine Pitrou / @pitrou: How did you compile pyodbc? I'm surprised about the following output (from GH bug you linked to):
linux-vdso.so.1 (0x00007ffe02bee000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f246c269000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f246c061000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f246be57000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f246ba66000)
/lib64/ld-linux-x86-64.so.2 (0x00007f246cd89000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f246b862000)
Antoine Pitrou / @pitrou: Here I'm getting the following output:
$ ldd venv-3.7/lib/python3.7/site-packages/pyodbc.cpython-37m-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffc0001f000)
libodbc.so.2 => /usr/lib/x86_64-linux-gnu/libodbc.so.2 (0x00007efbff044000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007efbfecbb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007efbfe91d000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007efbfe705000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007efbfe314000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007efbfe10a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007efbfdeeb000)
/lib64/ld-linux-x86-64.so.2 (0x00007efbff4d7000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007efbfdce7000)
Matt Calder: @pitrou I install pyodbc using pip. Those libraries are the ones used specifically by the clickhouse-odbc driver. Btw, I looked through the clickhouse-odbc driver source and it never uses regex. The connection between calling the odbc connection and the subsequent segfault in the regex library really is odd.
Antoine Pitrou / @pitrou:
[~mvcalder]
ok, and how did you install Arrow?
Francois Saint-Jacques / @fsaintjacques: Feels like libstdc++ link mismatch we had with tensorflow.
Wes McKinney / @wesm: Seems so. Perhaps clickhouse-odbc has some statically linked libstdc++ stuff that is causing a problem?
Matt Calder: I build arrow from source and install pyarrow as part of that. I also build clickhouse-odbc from source. Maybe relevant, they are built on different dcker images. I'll try isolating the build process to a single container with minimal dependencies.
Antoine Pitrou / @pitrou: If you build Arrow and clickhouse-odbc from source, you may also want to build pyodbc from source.
Wes McKinney / @wesm:
[~mvcalder]
were you able to resolve this?
Matt Calder: No, we have so far kept pandas at version 0.25.3. We're transitioning away from the odbc driver and to our own in-house version so the issue may be moot for us.
Matt
[I posted this issue to the pandas github|[https://github.com/pandas-dev/pandas/issues/31981]].
We get a segfault when making a call to pd.read_parquet after having made a connection to clickhouse via odbc. Like so,
This happens with pandas version 1.0.1 but not with pandas 0.25.3. Here's a stacktrace:
Environment: Ubuntu 18.04 Reporter: Matt Calder
Original Issue Attachments:
Externally tracked issue: https://github.com/pandas-dev/pandas/issues/31981
Note: This issue was originally created as ARROW-7873. Please see the migration documentation for further details.