Migrate to thriftpy away from 2.x-only thrift-python

wesm commented 8 years ago

Thriftpy also has a Cythonized binary protocol which is significantly faster according to the benchmarks. If we're serious about Python 3 support I think we should have one thrift code path.

laserson commented 8 years ago

Soliciting any thoughts from @caseyching @ishaan @romainr

caseyching commented 8 years ago

I've never used thriftpy before but it sounds like an improvement. I see you already filed https://github.com/eleme/thriftpy/issues/136 about SSL. SASL seems to be missing too.

romainr commented 8 years ago

Same, I never looked at thriftpy yet

FlorianWilhelm commented 8 years ago

@wesm Judging from impala._thrift_api the code path for Python3 already uses thriftpy. Does that mean the fast cythonized binary protocol is used if I am using Python 3?

laserson commented 8 years ago

When initially implementing it, we turned off the cythonized version bc it was causing problems.

https://github.com/cloudera/impyla/blob/master/impala/_thrift_api.py#L62

Unless thriftpy changed something to always use cythonized version, I'm guessing it's not using it.

FlorianWilhelm commented 8 years ago

@laserson Thanks for your fast reply. I am seeing quite high CPU usage when I am actually only retrieving the result set of a larger query and my effective transfer rate is with 1-2 MB/s quite low but I am still investigating if the problem is maybe network related. Anyway, what kind of problems did the cythonized version cause? and were they related to thriftpy, Cython itself or impyla? Thanks for your help btw.

laserson commented 8 years ago

I don't remember, but definitely try it out. Should just require uncommenting the usual imports and commenting out the others.

wesm commented 8 years ago

The thriftpy Cython issues may be fixed at this point, it would just need some testing to verify that it can be re-enabled

FlorianWilhelm commented 8 years ago

@wesm I tried it by just commenting/uncommenting the marked lines as stated by @laserson and got:

File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/impyla-0.13.8-py3.5.egg/impala/dbapi.py", line 147, in connect auth_mechanism=auth_mechanism) File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/impyla-0.13.8-py3.5.egg/impala/hiveserver2.py", line 659, in connect protocol = TBinaryProtocol(transport) File "thriftpy/protocol/cybin/cybin.pyx", line 417, in cybin.TCyBinaryProtocol.init (thriftpy/protocol/cybin/cybin.c:6121) TypeError: Cannot convert TSaslClientTransport to thriftpy.transport.cybase.CyTransportBase

Any ideas?

FlorianWilhelm commented 8 years ago

Okay, I found out that since we are using the Cython code path of thriftpy also the TSaslClientTansport which is written in pure Python should inherit from CyTransportBase. I added these changes in a thriftpy fork and tried it again. I got rid of the error message stated above but got instead:

ERROR:impala.hiveserver2:Failed to open transport (tries_left=1) Traceback (most recent call last): File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/impyla-0.13.8-py3.5.egg/impala/hiveserver2.py", line 832, in _execute return func(request) File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/thriftpy-0.3.9-py3.5-linux-x86_64.egg/thriftpy/thrift.py", line 195, in _req self._send(_api, **kwargs) File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/thriftpy-0.3.9-py3.5-linux-x86_64.egg/thriftpy/thrift.py", line 207, in _send self._oprot.trans.flush() File "/media/home/fwilhelm/miniconda3/envs/hive/lib/python3.5/site-packages/thrift_sasl/init.py", line 119, in flush message=self.sasl.getError()) thriftpy.transport.TTransportException: TTransportException(type=0, message=b'Error in sasl_encode (-7) SASL(-7): invalid parameter supplied: Parameter error in ../../lib/common.c near line 296')

Now I am at my wit's end. @wesm Could you point me to the right direction here or tell me if it is possible at all?

ecederstrand commented 7 years ago

Hi @FlorianWilhelm I just dug myself into the same hole as you. Did you find any solutions to this problem?

FlorianWilhelm commented 7 years ago

@ecederstrand No, not really and I tried a lot. The problems with large data sets and hiveserver2 (in my case) are actually manifold. Best way is to never retrieve a large data set by saving the results with a CREATE TABLE ... FROM... statement and using libhdfs3 to retrieve these results from HDFS. This is much more efficient since all data nodes will transfer the result set but of course this comes at a price since you might need to handle many files depending on the chunk size. But it is easy to write a small abstraction layer for that. This is the solution I opted for but that was just before I realized... why am I actually transferring that much data around anyways? Then I learned about Python UD(A)Fs and executed all aggregations on the Hadoop cluster itself which reduced the data a lot. I hope that helps.

wesm commented 7 years ago

@FlorianWilhelm saving Parquet to HDFS and reading with libhdfs or libhdfs3 may be a good option depending on the size of the dataset. Note I generally see much better throughput and latency from libhdfs vs. libhdfs3 if you have that option http://wesmckinney.com/blog/python-hdfs-interfaces/

Native Parquet reads in Python are also possible: http://wesmckinney.com/blog/python-parquet-update/

If you aren't using Kerberos, then you can also use hs2client to get better direct HS2 performance: https://github.com/cloudera/hs2client

FlorianWilhelm commented 7 years ago

@wesm thanks for these insights. I read your blog post, it's really funny that a java based libhdfs beats a C-based libhdfs3. How does your HdfsClient handle HA? I had a lengthy discussion with the libhdfs3 developers about that and it's in der master branch but still not yet released. I have never tried libhdfs since it seems to need all those Hadoop Java libs on the client side? That's a bit too much for me.

wesm commented 7 years ago

@FlorianWilhelm it should support HA because libhdfs supports HA. If you run into bugs let me know -- I personally wouldn't depend on libhdfs3 in production.

cloudera / impyla

Migrate to thriftpy away from 2.x-only thrift-python #147