cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
725 stars 247 forks source link

Numeric paramstyle substitution bug #349

Open MarechJ opened 5 years ago

MarechJ commented 5 years ago

When using a list for passing parameters and one of the values contains a :\d+ the value itself gets substituted. The bug applies to all paramstyles. See ipython logs below. I fixed it in: #348

This is the same as #317

In [5]: cursor = conn.cursor()

In [6]: cursor.execute("INSERT INTO test_impyla VALUES (?, ?)", ['this will go wrong:2', 'im going to creep in'])
---------------------------------------------------------------------------
HiveServer2Error                          Traceback (most recent call last)
<ipython-input-6-4361277cbece> in <module>
----> 1 cursor.execute("INSERT INTO test_impyla VALUES (?, ?)", ['this will go wrong:2', 'im going to creep in'])

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in execute(self, operation, parameters, configuration)
    300         # PEP 249
    301         self.execute_async(operation, parameters=parameters,
--> 302                            configuration=configuration)
    303         log.debug('Waiting for query to finish')
    304         self._wait_to_finish()  # make execute synchronous

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in execute_async(self, operation, parameters, configuration)
    341             self._last_operation = op
    342
--> 343         self._execute_async(op)
    344
    345     def _debug_log_state(self):

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in _execute_async(self, operation_fn)
    360         self._reset_state()
    361         self._debug_log_state()
--> 362         operation_fn()
    363         self._last_operation_active = True
    364         self._debug_log_state()

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in op()
    338             op = self.session.execute(self._last_operation_string,
    339                                       configuration,
--> 340                                       async=True)
    341             self._last_operation = op
    342

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in execute(self, statement, configuration, async)
   1025                                    confOverlay=configuration,
   1026                                    runAsync=async)
-> 1027         return self._operation('ExecuteStatement', req)
   1028
   1029     def get_databases(self, schema='.*'):

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in _operation(self, kind, request)
    955
    956     def _operation(self, kind, request):
--> 957         resp = self._rpc(kind, request)
    958         return self._get_operation(resp.operationHandle)
    959

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in _rpc(self, func_name, request)
    923         response = self._execute(func_name, request)
    924         self._log_response(func_name, response)
--> 925         err_if_rpc_not_ok(response)
    926         return response
    927

~/.pyenv/versions/3.6.5/envs/market-surveillance/lib/python3.6/site-packages/impala/hiveserver2.py in err_if_rpc_not_ok(resp)
    702             resp.status.statusCode != TStatusCode.SUCCESS_WITH_INFO_STATUS and
    703             resp.status.statusCode != TStatusCode.STILL_EXECUTING_STATUS):
--> 704         raise HiveServer2Error(resp.status.errorMessage)
    705
    706

HiveServer2Error: AnalysisException: Syntax error in line 1:
...S ('this will go wrong'im going to creep in'', 'im goi...
                             ^
Encountered: IDENTIFIER
Expected: CROSS, FROM, FULL, GROUP, HAVING, INNER, JOIN, LEFT, LIMIT, OFFSET, ON, ORDER, RIGHT, STRAIGHT_JOIN, UNION, USING, WHERE, COMMA

CAUSED BY: Exception: Syntax error
timarmstrong commented 5 years ago

348 provided a workaround by allowing specification of a particular parameter style.

I think we should keep this open to track doing single-pass parameter substitution. We still shouldn't have double substitution if people use mixed parameter styles.

gjask commented 1 year ago

I have run into this bug yesterday and spent non-trivial amount of time diagnosing it. I got a bit irritated too. I found this issue only after fully analysing underlying problem. I consider developing a fix for this. However it seems there is no direct clear approach to fixing this. I'd like to discuss preferred solution.

I believe it's correct to always use single paramstyle at most. Sadly current implementation does not require user to select one. And changing that would broke user's code. Also obviously we cannot to just set one as existing code can expect different one.

Another option is to try and guess expected paramstyle if not explicitly set. For example use the one that first matches number of params with number of markers. This seem to be most promising solution as it requires no change in user's code. But doesn't support combination of markers of different paramstyles (which is currently theoretically possible). Combination of different paramstyles is quirky at best as current implementation matches params from the start for each (non-numeric) paramstyle anyway. So I cannot really think about this use-case as supported.

Final option is to try to replace all markers using different paramstyles in single pass. This may seem as best approach. But I think it is flawed. The main problem is to check number of parameters with number of markers. Because combination of numeric paramstyle with any other may result in different number of parameters and markers being correct. Which makes this check very complicated and counter-intuitive.

Bad thing is that implementing any of these options can theoretically break user code, if any depends on such quirks. But I consider current state worse.

csringhofer commented 1 year ago

@gjask Is the problem relevant if a paramstyle was selected?

I think that generally the solution is to always set paramstyle. My understanding of DB API wiki says that the default of paramstyle is driver dependent, and some drivers may allow setting it to another value. The Impyla behavior of doing all possible substitutions by default seems like an Impala specific convenience feature that can lead to weird bugs.

My preferred long term solution would be to throw an error if no paramstyle is set. While this could break some existing programs, fixing it is very easy and I don't think that mixed substitution is something that anyone should rely on.

gjask commented 1 year ago

@csringhofer No, the problem occurs only when is is not selected.

I am not sure if it is possible to require user to always pass a paramstyle. I guess that would violate DB API spec. I think the most correct solution is to set default paramstyle and then allow user to specify different one if needed.

I am not sure if you noticed, but I already sent PR https://github.com/cloudera/impyla/pull/508 implementing single-pass param substitution. It aims to provide same functionality as before just without bugs caused by multiple passes during substitution. As that shouldn't brake any existing code (unless it deliberately depends on broken behaviour). I thought you would want to go with least amount of breaking changes. In the end I even kinda improved matching number of markers with number of parameters. It should be good enough now.

csringhofer commented 1 year ago

I am not sure if it is possible to require user to always pass a paramstyle. I guess that would violate DB API spec. You are right, the way Impyla expects paramstyle would violate DB API 2.0 spec. In DB API 2.0 the assumption is that a driver supports one paramstyle, set in the global variable paramstyle. I saw some drivers where this can be overridden when establishing the connection.

I found a wiki about a proposal for DB API 3.0 where paramstyle could be modified after connection/cursor construction: https://wiki.python.org/moin/DbApi3#Preliminary_Consensus

I think that the solution above would be the ideal way (the user can use any paramstyle, but has to set it explicitly if it is not the default), but in DB API 2.0 Impyla should stick to the current way.

Thanks for the fix https://github.com/cloudera/impyla/pull/508 !

gjask commented 1 year ago

Thank you for merging my PR. How long it usually takes to get that released? I would like to remove workaround on our side.

csringhofer commented 1 year ago

Is it enough if it is released as an alpha release, e.g. 0.19a1? That would mean that you can get it with "pip install impyla==0.19a1" but "pip install impyla" will still install 0.18.0

gjask commented 1 year ago

Ok, that probably would be good enough.