cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
730 stars 248 forks source link

Fix schema description returns for HiveServer2 when using dot in naming convention. #434

Closed danielewood closed 3 years ago

danielewood commented 3 years ago

Two parts of this change.

  1. Always escape the schema name when running SHOW TABLES. This fixes an error in apache/superset when querying a hive database partition that is separated with a dot.

  2. If show tables returns more than one column, it is assumed it is Hive returning the database name along with the partition. This fixes https://github.com/apache/superset/issues/8027

attilajeges commented 3 years ago

@danielewood Thanks for working on this!

It might be my ignorance about apache/superset, but I find the wording a bit confusing. What do you mean by " hive database partition"? Also, under what circumstances will show tables return multiple columns in hive?

danielewood commented 3 years ago

This is mostly my own ignorance of Hive, I can retract this PR as I have performed the necessary fixes on the sqlalchemy engine specs and will be submitting a PR to superset once we have confirmed all the needed functionality has been fixed by my changes.

The code there is also cleaner and more intelligent, so I'll make this PR reflect those changes as well.

So to describe my problem fully: We use Ascend.io which provides a Hive2 interface using the SparkSQL Thrift JDBC/ODBC Server.

SHOW SCHEMAS returns a namespace column with the format of "DataServiceName.DataFlowName"

For the purposes of discussion here, I will refer to "DataServiceName.DataFlowName" as "schema".

Example:

SHOW SCHEMA

namespace |--| MyProject MyProject.Billing MyProject.Compliance MyProject.Mailchimp MyProject.Zendesk

DEBUG:impala.hiveserver2:get_result_schema: schema=[('namespace', 'STRING', None, None, None, None, None)]

If I then run 'SHOW TABLES IN MyProject.Zendesk' I get a return that the Database name is not valid:

DEBUG:impala.hiveserver2:ExecuteStatement: resp=TExecuteStatementResp(status=TStatus(statusCode=3, 
infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running query: org.apache.spark.sql.AnalysisException: The database name is not valid: MyProject.Zendesk;:25:24', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:321', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:202', 
'org.apache.hive.service.cli.operation.Operation:run:Operation.java:278', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run:SparkExecuteStatementOperation.scala:46', 
'org.apache.spark.sql.hive.thriftserver.SparkOperation:$anonfun$run$1:SparkOperation.scala:44', 
'scala.runtime.java8.JFunction0$mcV$sp:apply:JFunction0$mcV$sp.java:23', 
'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties:SparkOperation.scala:78', 
'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties$:SparkOperation.scala:62', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:withLocalProperties:SparkExecuteStatementOperation.scala:46', 
'org.apache.spark.sql.hive.thriftserver.SparkOperation:run:SparkOperation.scala:44', 
'org.apache.spark.sql.hive.thriftserver.SparkOperation:run$:SparkOperation.scala:42', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:run:SparkExecuteStatementOperation.scala:46', 
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:484', 
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:472', 
'org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:310', 
'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:438', 
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', 
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', 
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 
'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:53', 
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1128', 
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:628', 
'java.lang.Thread:run:Thread.java:834', 
'*org.apache.spark.sql.AnalysisException:The database name is not valid: MyProject.Zendesk;:69:44', 
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1:applyOrElse:ResolveSessionCatalog.scala:384', 
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1:applyOrElse:ResolveSessionCatalog.scala:48', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:$anonfun$resolveOperatorsUp$3:AnalysisHelper.scala:90', 
'org.apache.spark.sql.catalyst.trees.CurrentOrigin$:withOrigin:TreeNode.scala:72', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:$anonfun$resolveOperatorsUp$1:AnalysisHelper.scala:90', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$:allowInvokingTransformsInAnalyzer:AnalysisHelper.scala:194', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:resolveOperatorsUp:AnalysisHelper.scala:86', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:resolveOperatorsUp$:AnalysisHelper.scala:84', 
'org.apache.spark.sql.catalyst.plans.logical.LogicalPlan:resolveOperatorsUp:LogicalPlan.scala:29', 
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog:apply:ResolveSessionCatalog.scala:48', 
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog:apply:ResolveSessionCatalog.scala:39', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$2:RuleExecutor.scala:149', 
'scala.collection.LinearSeqOptimized:foldLeft:LinearSeqOptimized.scala:126', 
'scala.collection.LinearSeqOptimized:foldLeft$:LinearSeqOptimized.scala:122', 
'scala.collection.immutable.List:foldLeft:List.scala:89', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$1:RuleExecutor.scala:146', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$1$adapted:RuleExecutor.scala:138', 
'scala.collection.immutable.List:foreach:List.scala:392', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:execute:RuleExecutor.scala:138', 
'org.apache.spark.sql.catalyst.analysis.Analyzer:org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext:Analyzer.scala:176', 
'org.apache.spark.sql.catalyst.analysis.Analyzer:execute:Analyzer.scala:170', 
'org.apache.spark.sql.catalyst.analysis.Analyzer:execute:Analyzer.scala:130', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$executeAndTrack$1:RuleExecutor.scala:116', 
'org.apache.spark.sql.catalyst.QueryPlanningTracker$:withTracker:QueryPlanningTracker.scala:88', 
'org.apache.spark.sql.catalyst.rules.RuleExecutor:executeAndTrack:RuleExecutor.scala:116', 
'org.apache.spark.sql.catalyst.analysis.Analyzer:$anonfun$executeAndCheck$1:Analyzer.scala:154', 
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$:markInAnalyzer:AnalysisHelper.scala:201', 
'org.apache.spark.sql.catalyst.analysis.Analyzer:executeAndCheck:Analyzer.scala:153', 
'org.apache.spark.sql.execution.QueryExecution:$anonfun$analyzed$1:QueryExecution.scala:69', 
'org.apache.spark.sql.catalyst.QueryPlanningTracker:measurePhase:QueryPlanningTracker.scala:111', 
'org.apache.spark.sql.execution.QueryExecution:$anonfun$executePhase$1:QueryExecution.scala:134', 
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768', 
'org.apache.spark.sql.execution.QueryExecution:executePhase:QueryExecution.scala:134', 
'org.apache.spark.sql.execution.QueryExecution:analyzed$lzycompute:QueryExecution.scala:69', 
'org.apache.spark.sql.execution.QueryExecution:analyzed:QueryExecution.scala:67', 
'org.apache.spark.sql.execution.QueryExecution:assertAnalyzed:QueryExecution.scala:59', 
'org.apache.spark.sql.Dataset$:$anonfun$ofRows$2:Dataset.scala:99', 
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768', 
'org.apache.spark.sql.Dataset$:ofRows:Dataset.scala:97', 
'org.apache.spark.sql.ascend.SessionCatalog:dataframeFor:SessionCatalog.scala:162', 
'org.apache.spark.sql.SparkSession:$anonfun$sql$1:SparkSession.scala:609', 
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768', 
'org.apache.spark.sql.SparkSession:sql:SparkSession.scala:601', 
'org.apache.spark.sql.SQLContext:sql:SQLContext.scala:650', 
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:280'], 
sqlState=None, 
errorCode=0, 
errorMessage='Error running query: org.apache.spark.sql.AnalysisException: The database name is not valid: MyProject.Zendesk;'), 
operationHandle=None)

However, if I escape the schema name, there is no issue:

SHOW TABLES IN `MyProject.Zendesk`
database tableName isTemporary
MyProject.Zendesk Consumer_Bundle_Transform false
MyProject.Zendesk Zendesk_Webhook false
DEBUG:impala.hiveserver2:GetResultSetMetadata: resp=TGetResultSetMetadataResp(status=TStatus(statusCode=0, 
infoMessages=None, 
sqlState=None, 
errorCode=None, 
errorMessage=None), 
schema=TTableSchema(columns=[TColumnDesc(columnName='database', 
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7, 
typeQualifiers=None), 
arrayEntry=None, 
mapEntry=None, 
structEntry=None, 
unionEntry=None, 
userDefinedTypeEntry=None)]), 
position=1, 
comment=''), 
TColumnDesc(columnName='tableName', 
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7, 
typeQualifiers=None), 
arrayEntry=None, 
mapEntry=None, 
structEntry=None, 
unionEntry=None, 
userDefinedTypeEntry=None)]), 
position=2, 
comment=''), 
TColumnDesc(columnName='isTemporary', 
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=0, 
typeQualifiers=None), 
arrayEntry=None, 
mapEntry=None, 
structEntry=None, 
unionEntry=None, 
userDefinedTypeEntry=None)]), 
position=3, 
comment='')]))

DEBUG:impala.hiveserver2:get_result_schema: schema=[('database', 
'STRING', 
None, 
None, 
None, 
None, 
None), 
('tableName', 
'STRING', 
None, 
None, 
None, 
None, 
None), 
('isTemporary', 
'BOOLEAN', 
None, 
None, 
None, 
None, 
None)]

So, given the above, what 1. does is escape the schema to prevent the issue with unescaped schemas (my forthcoming change will make it strip and re-insert if already exists).

What 2 does is check if the return of show tables is more than one column, if it is more than one column, it will grab the second column. Otherwise, it work as it currently does and return the first column.

attilajeges commented 3 years ago

@danielewood Thanks for the explanation, it makes sense. I trust you tested the patch, right? Can I go ahead and merge it in?

danielewood commented 3 years ago

@attilajeges I have tested these patches in my environment, but I dont have a standard instance to test against to validate a lack of issues with non-Spark Hive2 or Impala.

That said, as long as Hive2/Impala accept an escaped(with backticks) database name, there should be no issue whatsoever.

If the following commands work on standard Hive2/Impala, then this patch will have no impact on them:

SHOW TABLES IN `database_name_here`
SELECT * FROM `database_name_here`.table_name_here LIMIT 1
superset_app             | 2021-05-18 16:53:14,581 INFO sqlalchemy.engine.base.Engine SHOW TABLES IN `MyProject.Zendesk`
superset_app             | 2021-05-18 16:53:14,580:DEBUG:impala.hiveserver2:Query finished
superset_app             | 2021-05-18 16:53:14,581:INFO:sqlalchemy.engine.base.Engine:SHOW TABLES IN `MyProject.Zendesk`
superset_app             | 2021-05-18 16:53:14,581:INFO:sqlalchemy.engine.base.Engine:{}
superset_app             | 2021-05-18 16:53:14,581 INFO sqlalchemy.engine.base.Engine {}
superset_app             | 2021-05-18 16:53:14,581:DEBUG:impala.hiveserver2:Executing query SHOW TABLES IN `MyProject.Zendesk`
attilajeges commented 3 years ago

@danielewood Thanks for clarifying. Putting database names between ` works in impala and hive, so I think the patch is good to go.