Open IceS2 opened 6 years ago
Can you provide us with a backtrace related to the segfault?
On Linux you can get it with:
ulimit -c unlimited
<run python code>
gdb python core
In the then resulting gdb prompt, enter bt full
and paste the output here (please be careful that it does not contain credentials).
It seems I can't Oo... Any idea why?
$ ulimit -c unlimited
$ python test_turbodbc_pyarrow.py
[1] 26933 segmentation fault (core dumped) python test_turbodbc_pyarrow.py
$ gdb python core
GNU gdb (GDB) 8.0.1
Copyright © 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
/home/pablo/workspace/scratch/core: No such file or directory.
(gdb) bt full
No stack.
(gdb)
Hello @IceS2! Thanks for reporting! You did well :-).
I have a hunch that the prefer_unicode=True
in combination with fetchallarrow()
is the culprit here, as I fear that this code path is not properly implemented yet. Even though prefer_unicode=True
is the recommended setting for MSSQL, please check whether the segmentation fault disappears if this option is set to False
.
As a workaround, you could use fetchallnumpy()
instead of fetchallarrow()
. Performance is comparable, and fetchallnumpy()
has full support for prefer_unicode=True
.
@IceS2 it could also be that your core is named core.26933
(taken from the message 26933 segmentation fault (core dumped)
). If the numbered suffix is used depends a bit on your distribution.
@MathMagique, @xhochy, Sorry for the delayed answer. Wasn't near my computer past weekend!
So, I've run the code again setting prefer_unicode=False
and the result was the same: [1] 23037 segmentation fault (core dumped)
without any backtrace.
It seems to work with cursor.fetchallnumpy()
. I was testing turbodbc because I'm experimenting with pyarrow and I need to do some batch extractions from a database. turbodbc into arrow table would be awesome!
My fallback plan is to work with SqlAlchemy and Pandas. Not sure how to transform the OrderedDict from cursor.fetchallnumpy()
to a pyarrow table.
What version of FreeTDS and unixODBC are you using? Can you test using the Microsoft ODBC driver for Linux instead of FreeTDS? See: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server
Hey @dirkjonker, I've just tested using the Microsoft ODBC driver you mentioned. The result was the same [1] 3542 segmentation fault (core dumped)
The version of the packages you asked are
extra/unixodbc 2.3.4-2
extra/freetds 1.00.44-1
local/msodbcsql 13.1.9.1-1
That's too bad, sometimes switching the driver works to resolve this type of problem.
What types of columns are in the table you are selecting from?
@IceS2 are you on Fedora 24+? There we have a known problem with pyarrow
in combination with turbodbc
.
It can be fixed by also building pyarrow
from source which is not totally simple: https://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos or we could continue to work on providing manylinux1
Wheels for turbodbc: https://github.com/blue-yonder/turbodbc/pull/108
Alternatively, using a conda based installation instead of a pip-based one will work.
@xhochy, I'm actually running Arch Linux! Do you think it'd be fixed as well by building pyarrow from source? I could try that as soon as I get some "me time"
@IceS2 It could be a possible fix. I guess the Fedora problem is due to Turbodbc being compiled with a different C++ ABI than the pyarrow wheel. Rebuilding both with the same ABI should fix the problems.
Hey @xhochy, Sorry for the late answer. I had to work on other stuff first.
I'm back at turbodbc, but after I upgraded pyarrow to 0.8.0, I was getting an error with turbodbc saying I didn't have the pyarrow support installed. So I uninstalled turbodbc and tried to install it back with pip, but I'm getting error: command 'gcc' failed with exit status 1
Can you help me out? Thanks!
@IceS2 Hi again! Have you tried using more recent versions of turbodbc/pyarrow in the mean time? Does this fix things?
Same error, with same line (the last)
from turbodbc import connect
import pyarrow
connection = connect(dsn='mysql_DNS_ANSI')
cursor = connection.cursor()
cursor.execute('SELECT col1 from test01;')
table = cursor.fetchallarrow()
change last time to print cursor.fetchall() returns:
[[1L], [2L], [3L], [4L], [5L]]
Can be reproduced with this command:
docker run -it albertozgz/turbodbc_extrator:debian9 bash
(You only need connect this Docker to your database, I uses MySQL 8.0)
TIP1: table=cursor.fetchallnumpy() works fine TIP2: tested ANSI and UNICODE driver TIP3: tested _fetchallarrow(adaptiveintegers=True/False) TIP4:
batches = cursor.fetcharrowbatches()
for batch in batches:
print(batch)
segmentation fault (core dumped)
@xhochy Would you have the time to look at @albertoRamon 's reproducing example, please?
This is the same problem as above. Debian 9 builds with by default with a different C++ ABI than the pyarrow
wheels are built with. As long as we don't ship turbodbc
manylinux1
wheels, these segfaults will persist.
Would it work to switch to the conda environment with our "blessed" builds?
Yes using pyarrow
and turbodbc
both from conda-forge
will work. They are both build in the same consistent environment.
@albertoRamon Could you try using the turbodbc conda package, please? https://anaconda.org/conda-forge/turbodbc
Yes of course
Any test or test that they want to do I can prove it Or if the solution is not to use debian9 (I tried with Alpine3.8 and Debian10 and it did not work)
Anything too modern will not work because the precompiled pyarrow wheel uses a "classic" version of the ABIs, while pip install turbodbc will compile stuff with the latest and greatest ABIs. Conda packages for turbodbc and pyarrow are built with consistent settings, and should work on any modern system.
@MathMagique @xhochy , Thanks Your suggestion works fine
pip uninstall pyarrow
pip uninstall turbodbc
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
chmod +x Miniconda2-latest-Linux-x86_64.sh
./Miniconda2-latest-Linux-x86_64.sh
conda install -c conda-forge pyarrow
source ~/.bashrc
conda install -c conda-forge pyarrow
conda install -c conda-forge turbodbc
python:
table = cursor.fetchallarrow()
print table.num_rows
bash:> 5
If you think that the best option for production environment is download code from Git and compile it. I will be happy to modify the docker file to realize these steps
BR
I never would download code from Git for production; if anything, download source packages from pypi.org. I'd suggest to go down the conda route for production, however, as this has already solved the hassle of compiling stuff the right way.
Hello guys, it's the first time I post an Issue on a project, so I'm sorry if I'm doing it the wrong way, please correct me if wrong (=
I'm trying to use turbodbc with pyarrow and I'm running into a segmentation fault issue. I'm querying a SQLServer database using FreeTDS. After I assign cursor.fetchallarrow() to a variable, it runs automatically into a segmentation fault. If it doesn't run automatically into the segmentation fault, as soon as I try to do anything with that variable it runs into segmentation fault. My python version and installed packages:
You can use the next code to try to reproduce the issue. I just took off the database credentials.