cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
728 stars 248 forks source link

TypeError: byte string expected (isnull.frombytes(nulls)) #338

Open rav009 opened 5 years ago

rav009 commented 5 years ago

env: python3.5.1 thrift 0.11.0 thrift-sasl 0.3.0 thriftpy 0.3.9 impyla 0.14.2.2

my code: from impala.dbapi import connect from impala.util import as_pandas icon=connect(host='bd-slave07-pe2.f.com',port=21050,user='username',auth_mechanism='GSSAPI', password='psd') cs = icon.cursor(); cs.execute('select * from table limit 100') df = as_pandas(cs)

error msg: /opt/python3.5/lib/python3.5/site-packages/impala/hiveserver2.py in init(self, trowset, schema, convert_types) 853 854 is_null = bitarray(endian='little') --> 855 is_null.frombytes(nulls) 856 857 # Ref HUE-2722, HiveServer2 sometimes does not add trailing '\x00'

TypeError: byte string expected

rav009 commented 5 years ago

I have some clues. Seems like you cannot use 'select *' for a table with nested type columns(e.g. map type), otherwise you will get this error.

timarmstrong commented 5 years ago

Can you include the output of "DESCRIBE store_dw_des.dw_loc_sku_day_actual_e_business"?

EdTheEagle commented 4 years ago

Same error different version:

Something strange is happening in my case, everything works when I set the LIMIT to 138, when I change it to 139 I get the same error as rav009. I have the exact same behavior when I change the query, it still fails with LIMIT set to equal to 139. Setting the LIMIT between 139 and roughly 500 yields the same error, 500 and up return another error, ( OverflowError: Python int too large to convert to C long) which I pasted below for reference

Any ideas on what is causing this?

Many thanks!

env: python3.6.2 thrift 0.13.0 thrift-sasl 0.4a1 thriftpy2 0.4.10 impyla 0.16.2

My code:

import pandas as pd from impala.dbapi import connect from impala.util import as_pandas

conn_inter = connect(host=DRONA_IMPALA_HOST, port=DRONA_IMPALA_PORT, use_ssl=True, ca_cert=None, auth_mechanism='PLAIN', user=IMPALA_USER, password=IMPALA_PASSWORD, ) cursor = conn_inter.cursor()

table = 'sw_os.min_data_kudu'

cursor.execute('SELECT * FROM table LIMIT 139') data = as_pandas(cursor)

Error:

TypeError Traceback (most recent call last)

in () 18 cursor.execute(K) 19 print('connection exec') ---> 20 data = as_pandas(cursor) 21 print('data dione') 22 print(data.shape) c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\util.py in as_pandas(cursor, coerce_float) 61 from pandas import DataFrame # pylint: disable=import-error 62 names = [metadata[0] for metadata in cursor.description] ---> 63 return DataFrame.from_records(cursor.fetchall(), columns=names, 64 coerce_float=coerce_float) 65 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in fetchall(self) 533 log.debug('Fetching all result rows') 534 try: --> 535 return list(self) 536 except StopIteration: 537 return [] c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in __next__(self) 581 self._buffer = self._last_operation.fetch(self.description, 582 self.buffersize, --> 583 convert_types=self.convert_types) 584 if len(self._buffer) > 0: 585 log.debug('__next__: popping row out of buffer') c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in fetch(self, schema, max_rows, orientation, convert_types) 1242 resp = self._rpc('FetchResults', req) 1243 return self._wrap_results(resp.results, resp.hasMoreRows, schema, -> 1244 convert_types=convert_types) 1245 1246 def _wrap_results(self, results, expect_more_rows, schema, convert_types=True): c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in _wrap_results(self, results, expect_more_rows, schema, convert_types) 1247 if self.is_columnar: 1248 log.debug('fetch_results: constructing CBatch') -> 1249 return CBatch(results, expect_more_rows, schema, convert_types=convert_types) 1250 else: 1251 log.debug('fetch_results: constructing RBatch') c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in __init__(self, trowset, expect_more_rows, schema, convert_types) 921 922 is_null = bitarray(endian='little') --> 923 is_null.frombytes(nulls) 924 925 # Ref HUE-2722, HiveServer2 sometimes does not add trailing '\x00' TypeError: bytes expected ERROR when LIMIT is higher than 500: --------------------------------------------------------------------------- OverflowError Traceback (most recent call last) in () 18 cursor.execute(K) 19 print('connection exec') ---> 20 data = as_pandas(cursor) 21 print('data dione') 22 print(data.shape) c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\util.py in as_pandas(cursor, coerce_float) 61 from pandas import DataFrame # pylint: disable=import-error 62 names = [metadata[0] for metadata in cursor.description] ---> 63 return DataFrame.from_records(cursor.fetchall(), columns=names, 64 coerce_float=coerce_float) 65 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in fetchall(self) 533 log.debug('Fetching all result rows') 534 try: --> 535 return list(self) 536 except StopIteration: 537 return [] c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in __next__(self) 581 self._buffer = self._last_operation.fetch(self.description, 582 self.buffersize, --> 583 convert_types=self.convert_types) 584 if len(self._buffer) > 0: 585 log.debug('__next__: popping row out of buffer') c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in fetch(self, schema, max_rows, orientation, convert_types) 1240 orientation=orientation, 1241 maxRows=max_rows) -> 1242 resp = self._rpc('FetchResults', req) 1243 return self._wrap_results(resp.results, resp.hasMoreRows, schema, 1244 convert_types=convert_types) c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in _rpc(self, func_name, request) 990 def _rpc(self, func_name, request): 991 self._log_request(func_name, request) --> 992 response = self._execute(func_name, request) 993 self._log_response(func_name, response) 994 err_if_rpc_not_ok(response) c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\impala\hiveserver2.py in _execute(self, func_name, request) 1007 log.debug('Transport opened') 1008 func = getattr(self.client, func_name) -> 1009 return func(request) 1010 except socket.error: 1011 log.exception('Failed to open transport (tries_left=%s)', c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\thrift.py in _req(self, _api, *args, **kwargs) 217 # wait result only if non-oneway 218 if not getattr(result_cls, "oneway"): --> 219 return self._recv(_api) 220 221 def _send(self, _api, **kwargs): c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\thrift.py in _recv(self, _api) 236 raise x 237 result = getattr(self._service, _api + "_result")() --> 238 result.read(self._iprot) 239 self._iprot.read_message_end() 240 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\thrift.py in read(self, iprot) 158 159 def read(self, iprot): --> 160 iprot.read_struct(self) 161 162 def write(self, oprot): c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(self, obj) 385 386 def read_struct(self, obj): --> 387 return read_struct(self.trans, obj, self.decode_response) 388 389 def write_struct(self, obj): c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(inbuf, obj, decode_response) 314 315 setattr(obj, f_name, --> 316 read_val(inbuf, f_type, f_container_spec, decode_response)) 317 318 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 287 elif ttype == TType.STRUCT: 288 obj = spec() --> 289 read_struct(inbuf, obj, decode_response) 290 return obj 291 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(inbuf, obj, decode_response) 314 315 setattr(obj, f_name, --> 316 read_val(inbuf, f_type, f_container_spec, decode_response)) 317 318 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 287 elif ttype == TType.STRUCT: 288 obj = spec() --> 289 read_struct(inbuf, obj, decode_response) 290 return obj 291 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(inbuf, obj, decode_response) 314 315 setattr(obj, f_name, --> 316 read_val(inbuf, f_type, f_container_spec, decode_response)) 317 318 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 254 255 for i in range(sz): --> 256 result.append(read_val(inbuf, v_type, v_spec, decode_response)) 257 return result 258 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 287 elif ttype == TType.STRUCT: 288 obj = spec() --> 289 read_struct(inbuf, obj, decode_response) 290 return obj 291 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(inbuf, obj, decode_response) 314 315 setattr(obj, f_name, --> 316 read_val(inbuf, f_type, f_container_spec, decode_response)) 317 318 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 287 elif ttype == TType.STRUCT: 288 obj = spec() --> 289 read_struct(inbuf, obj, decode_response) 290 return obj 291 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_struct(inbuf, obj, decode_response) 314 315 setattr(obj, f_name, --> 316 read_val(inbuf, f_type, f_container_spec, decode_response)) 317 318 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 254 255 for i in range(sz): --> 256 result.append(read_val(inbuf, v_type, v_spec, decode_response)) 257 return result 258 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\protocol\binary.py in read_val(inbuf, ttype, spec, decode_response) 228 elif ttype == TType.STRING: 229 sz = unpack_i32(inbuf.read(4)) --> 230 byte_payload = inbuf.read(sz) 231 232 # Since we cannot tell if we're getting STRING or BINARY c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thrift_sasl\__init__.py in read(self, sz) 171 return ret 172 --> 173 self._read_frame() 174 return ret + self.__rbuf.read(sz - len(ret)) 175 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thrift_sasl\__init__.py in _read_frame(self) 188 else: 189 # If the frames are not encoded, just pass it through --> 190 decoded = self._trans.read(length) 191 self.__rbuf = BufferIO(decoded) 192 c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\site-packages\thriftpy2\transport\socket.py in read(self, sz) 108 while True: 109 try: --> 110 buff = self.sock.recv(sz) 111 except socket.error as e: 112 if e.errno == errno.EINTR: c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\ssl.py in recv(self, buflen, flags) 985 "non-zero flags not allowed in calls to recv() on %s" % 986 self.__class__) --> 987 return self.read(buflen) 988 else: 989 return socket.recv(self, buflen, flags) c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\ssl.py in read(self, len, buffer) 863 raise ValueError("Read on closed or unwrapped SSL socket.") 864 try: --> 865 return self._sslobj.read(len, buffer) 866 except SSLError as x: 867 if x.args[0] == SSL_ERROR_EOF and self.suppress_ragged_eofs: c:\users\e299644\appdata\local\conda\conda\envs\geos\lib\ssl.py in read(self, len, buffer) 625 v = self._sslobj.read(len, buffer) 626 else: --> 627 v = self._sslobj.read(len) 628 return v 629 OverflowError: Python int too large to convert to C long
timarmstrong commented 4 years ago

@EdTheEagle can you share the version of impala (i.e. output of select version()) and the output of "describe table" (if not the column names, at least the types).

MacJei commented 4 years ago

@timarmstrong i have the same problem. impala = 2.12.0

id - double crt_mnemo - string

from impala.dbapi import connect 
from impala.util import as_pandas
import pandas as pd

impala_conn = connect(host='hostname', port=21050, auth_mechanism='GSSAPI', timeout=100000, use_ssl=True, ca_cert=None, ldap_user=None, ldap_password=None, kerberos_service_name='impala')

df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 1000", impala_conn)
print(df)

@EdTheEagle did you solve this problem?

EdTheEagle commented 4 years ago

@MacJei

I did not solve the problem. I work in a company where there might be a download limit set on these kind of calls.

I am using the pyodbc package which works for me and I did not investigate further.

Good luck!

schuderer commented 4 years ago

We appear to have the same problem (SSL: "OverflowError: Python int too large to convert to C long").

Trying to connect from Python 3.6 (tried from Windows 10 and from Red Hat Linux) to Cloudera Impala on a kerberized Oracle Big Data Appliance.

Our code is pretty much the same as the code example given from the OP, only without user/password as we're using ticketing/winkerberos. Like in the OP, we don't have this error if we limit the size of the result (by querying small tables or using LIMIT), but do if the result is anything more than a few KB or so.

I am happy to provide more details if needed.

PDXKor commented 4 years ago

Hello I also have the same issue.

aconstantin2 commented 4 years ago

Hopefully this helps someone: I had pretty much the same issue, with 2 different errors depending on my LIMIT.

'TypeError: bytes expected'
'OverflowError: signed integer is greater than maximum'.

What seems to have solved it was setting the buffersize + thrift request size to conservative values, as they seem to default to something that can overflow.

cursor.set_arraysize(10)
cursor.execute("set batch_size=10")

I was able to retrieve 10k rows like this, whereas before it was crashing at 60.

I got the idea from https://issues.apache.org/jira/browse/IMPALA-1618

n9986 commented 3 years ago

@aconstantin2 After digging for hours, your solution worked! Thank you!