cursor.fetchallarrow() followed by SegmentationFault

IceS2 commented 6 years ago

Hello guys, it's the first time I post an Issue on a project, so I'm sorry if I'm doing it the wrong way, please correct me if wrong (=

I'm trying to use turbodbc with pyarrow and I'm running into a segmentation fault issue. I'm querying a SQLServer database using FreeTDS. After I assign cursor.fetchallarrow() to a variable, it runs automatically into a segmentation fault. If it doesn't run automatically into the segmentation fault, as soon as I try to do anything with that variable it runs into segmentation fault. My python version and installed packages:

Python 3.6.3
ansible==2.4.0.0
asn1crypto==0.23.0
attrs==17.3.0
avro-python3==1.8.2
awscli==1.11.143
bcrypt==3.1.3
beautifulsoup4==4.6.0
boto==2.48.0
boto3==1.4.7
botocore==1.7.1
bs4==0.0.1
cached-property==1.3.0
certifi==2017.7.27.1
cffi==1.11.2
chardet==3.0.4
colorama==0.3.7
colorclass==2.2.0
configparser==3.5.0
cryptography==2.0.3
Cython==0.27.3
decorator==4.1.2
docker==2.5.1
docker-compose==1.15.0
docker-pycreds==0.2.1
dockerpty==0.4.1
docopt==0.6.2
docutils==0.14
formats==0.1.1
google-api-python-client==1.6.4
gspread==0.6.2
httplib2==0.10.3
idna==2.6
ipython==6.1.0
ipython-genutils==0.2.0
jedi==0.10.2
Jinja2==2.9.6
jmespath==0.9.3
jsonschema==2.6.0
MarkupSafe==1.0
mock==2.0.0
numpy==1.13.1
oauth2client==4.1.2
pandas==0.20.3
paramiko==2.3.1
pbr==3.1.1
pexpect==4.2.1
pickleshare==0.7.4
pluggy==0.6.0
prompt-toolkit==1.0.15
ptyprocess==0.5.2
py==1.5.2
pyarrow==0.7.1
pyasn1==0.3.7
pyasn1-modules==0.1.5
pybind11==2.2.1
pycairo==1.15.4
pycparser==2.18
pycrypto==2.6.1
Pygments==2.2.0
pymssql==2.1.3
PyMySQL==0.7.11
PyNaCl==1.1.2
pyOpenSSL==17.3.0
pytest==3.3.0
python-dateutil==2.6.1
pytz==2017.2
pywal==0.7.1
PyYAML==3.12
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.10
simplegeneric==0.8.1
six==1.11.0
slacker==0.9.60
SQLAlchemy==1.1.13
texttable==0.8.8
tortilla==0.4.2
traitlets==4.3.2
turbodbc==2.4.1
ua-parser==0.7.3
Unidecode==0.4.21
uritemplate==3.0.0
urllib3==1.22
user-agents==1.1.0
wcwidth==0.1.7
websocket-client==0.44.0
xlrd==1.1.0

You can use the next code to try to reproduce the issue. I just took off the database credentials.

from turbodbc import connect, make_options

options = make_options(prefer_unicode=True)
connection = connect(driver='FreeTDS', server='<server>', port='<port>', database='<database>', uid='<uid>', pwd='<pwd>', turbodb_options=options)

cursor = connection.cursor()
cursor.execute('select * from <table>')

table = cursor.fetchallarrow()

xhochy commented 6 years ago

Can you provide us with a backtrace related to the segfault?

On Linux you can get it with:

ulimit -c unlimited
<run python code>
gdb python core

In the then resulting gdb prompt, enter bt full and paste the output here (please be careful that it does not contain credentials).

IceS2 commented 6 years ago

It seems I can't Oo... Any idea why?

$ ulimit -c unlimited
$ python test_turbodbc_pyarrow.py
[1]    26933 segmentation fault (core dumped)  python test_turbodbc_pyarrow.py
$ gdb python core
GNU gdb (GDB) 8.0.1
Copyright © 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
/home/pablo/workspace/scratch/core: No such file or directory.
(gdb) bt full
No stack.
(gdb)

MathMagique commented 6 years ago

Hello @IceS2! Thanks for reporting! You did well :-).

I have a hunch that the prefer_unicode=True in combination with fetchallarrow() is the culprit here, as I fear that this code path is not properly implemented yet. Even though prefer_unicode=True is the recommended setting for MSSQL, please check whether the segmentation fault disappears if this option is set to False.

As a workaround, you could use fetchallnumpy() instead of fetchallarrow(). Performance is comparable, and fetchallnumpy() has full support for prefer_unicode=True.

xhochy commented 6 years ago

@IceS2 it could also be that your core is named core.26933 (taken from the message 26933 segmentation fault (core dumped)). If the numbered suffix is used depends a bit on your distribution.

IceS2 commented 6 years ago

@MathMagique, @xhochy, Sorry for the delayed answer. Wasn't near my computer past weekend! So, I've run the code again setting prefer_unicode=False and the result was the same: [1] 23037 segmentation fault (core dumped) without any backtrace.

It seems to work with cursor.fetchallnumpy(). I was testing turbodbc because I'm experimenting with pyarrow and I need to do some batch extractions from a database. turbodbc into arrow table would be awesome! My fallback plan is to work with SqlAlchemy and Pandas. Not sure how to transform the OrderedDict from cursor.fetchallnumpy() to a pyarrow table.

dirkjonker commented 6 years ago

What version of FreeTDS and unixODBC are you using? Can you test using the Microsoft ODBC driver for Linux instead of FreeTDS? See: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server

IceS2 commented 6 years ago

Hey @dirkjonker, I've just tested using the Microsoft ODBC driver you mentioned. The result was the same [1] 3542 segmentation fault (core dumped)

The version of the packages you asked are

extra/unixodbc 2.3.4-2
extra/freetds 1.00.44-1
local/msodbcsql 13.1.9.1-1

dirkjonker commented 6 years ago

That's too bad, sometimes switching the driver works to resolve this type of problem.

What types of columns are in the table you are selecting from?

xhochy commented 6 years ago

@IceS2 are you on Fedora 24+? There we have a known problem with pyarrow in combination with turbodbc.

xhochy commented 6 years ago

It can be fixed by also building pyarrow from source which is not totally simple: https://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos or we could continue to work on providing manylinux1 Wheels for turbodbc: https://github.com/blue-yonder/turbodbc/pull/108

Alternatively, using a conda based installation instead of a pip-based one will work.

IceS2 commented 6 years ago

@xhochy, I'm actually running Arch Linux! Do you think it'd be fixed as well by building pyarrow from source? I could try that as soon as I get some "me time"

xhochy commented 6 years ago

@IceS2 It could be a possible fix. I guess the Fedora problem is due to Turbodbc being compiled with a different C++ ABI than the pyarrow wheel. Rebuilding both with the same ABI should fix the problems.

IceS2 commented 6 years ago

Hey @xhochy, Sorry for the late answer. I had to work on other stuff first. I'm back at turbodbc, but after I upgraded pyarrow to 0.8.0, I was getting an error with turbodbc saying I didn't have the pyarrow support installed. So I uninstalled turbodbc and tried to install it back with pip, but I'm getting error: command 'gcc' failed with exit status 1 Can you help me out? Thanks!

MathMagique commented 6 years ago

@IceS2 Hi again! Have you tried using more recent versions of turbodbc/pyarrow in the mean time? Does this fix things?

albertoRamon commented 6 years ago

Same error, with same line (the last)

from turbodbc import connect
import  pyarrow
connection = connect(dsn='mysql_DNS_ANSI')
cursor = connection.cursor()
cursor.execute('SELECT col1 from test01;')
table = cursor.fetchallarrow()

change last time to print cursor.fetchall() returns:

[[1L], [2L], [3L], [4L], [5L]]

Can be reproduced with this command:

docker run -it albertozgz/turbodbc_extrator:debian9 bash

(You only need connect this Docker to your database, I uses MySQL 8.0)

TIP1: table=cursor.fetchallnumpy() works fine TIP2: tested ANSI and UNICODE driver TIP3: tested _fetchallarrow(adaptiveintegers=True/False) TIP4:

batches = cursor.fetcharrowbatches()
for batch in batches:
  print(batch)

segmentation fault (core dumped)

MathMagique commented 6 years ago

@xhochy Would you have the time to look at @albertoRamon 's reproducing example, please?

xhochy commented 6 years ago

This is the same problem as above. Debian 9 builds with by default with a different C++ ABI than the pyarrow wheels are built with. As long as we don't ship turbodbc manylinux1 wheels, these segfaults will persist.

MathMagique commented 6 years ago

Would it work to switch to the conda environment with our "blessed" builds?

xhochy commented 6 years ago

Yes using pyarrow and turbodbc both from conda-forge will work. They are both build in the same consistent environment.

MathMagique commented 6 years ago

@albertoRamon Could you try using the turbodbc conda package, please? https://anaconda.org/conda-forge/turbodbc

albertoRamon commented 6 years ago

Yes of course

Any test or test that they want to do I can prove it Or if the solution is not to use debian9 (I tried with Alpine3.8 and Debian10 and it did not work)

MathMagique commented 6 years ago

Anything too modern will not work because the precompiled pyarrow wheel uses a "classic" version of the ABIs, while pip install turbodbc will compile stuff with the latest and greatest ABIs. Conda packages for turbodbc and pyarrow are built with consistent settings, and should work on any modern system.

albertoRamon commented 6 years ago

@MathMagique @xhochy , Thanks Your suggestion works fine

 pip uninstall pyarrow
 pip uninstall turbodbc

 wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh 
 chmod +x  Miniconda2-latest-Linux-x86_64.sh 
 ./Miniconda2-latest-Linux-x86_64.sh 
 conda install -c conda-forge pyarrow
 source ~/.bashrc

 conda install -c conda-forge pyarrow
 conda install -c conda-forge turbodbc

python:

table = cursor.fetchallarrow()
print table.num_rows

bash:> 5

If you think that the best option for production environment is download code from Git and compile it. I will be happy to modify the docker file to realize these steps

BR

MathMagique commented 6 years ago

I never would download code from Git for production; if anything, download source packages from pypi.org. I'd suggest to go down the conda route for production, however, as this has already solved the hassle of compiling stuff the right way.

blue-yonder / turbodbc

cursor.fetchallarrow() followed by SegmentationFault #139