Closed danielewood closed 3 years ago
@danielewood Thanks for working on this!
It might be my ignorance about apache/superset, but I find the wording a bit confusing. What do you mean by " hive database partition"? Also, under what circumstances will show tables return multiple columns in hive?
This is mostly my own ignorance of Hive, I can retract this PR as I have performed the necessary fixes on the sqlalchemy engine specs and will be submitting a PR to superset once we have confirmed all the needed functionality has been fixed by my changes.
The code there is also cleaner and more intelligent, so I'll make this PR reflect those changes as well.
So to describe my problem fully: We use Ascend.io which provides a Hive2 interface using the SparkSQL Thrift JDBC/ODBC Server.
SHOW SCHEMAS
returns a namespace column with the format of "DataServiceName.DataFlowName"
For the purposes of discussion here, I will refer to "DataServiceName.DataFlowName" as "schema".
Example:
SHOW SCHEMA
namespace |--| MyProject MyProject.Billing MyProject.Compliance MyProject.Mailchimp MyProject.Zendesk
DEBUG:impala.hiveserver2:get_result_schema: schema=[('namespace', 'STRING', None, None, None, None, None)]
If I then run 'SHOW TABLES IN MyProject.Zendesk' I get a return that the Database name is not valid:
DEBUG:impala.hiveserver2:ExecuteStatement: resp=TExecuteStatementResp(status=TStatus(statusCode=3,
infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running query: org.apache.spark.sql.AnalysisException: The database name is not valid: MyProject.Zendesk;:25:24',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:321',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:202',
'org.apache.hive.service.cli.operation.Operation:run:Operation.java:278',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run:SparkExecuteStatementOperation.scala:46',
'org.apache.spark.sql.hive.thriftserver.SparkOperation:$anonfun$run$1:SparkOperation.scala:44',
'scala.runtime.java8.JFunction0$mcV$sp:apply:JFunction0$mcV$sp.java:23',
'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties:SparkOperation.scala:78',
'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties$:SparkOperation.scala:62',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:withLocalProperties:SparkExecuteStatementOperation.scala:46',
'org.apache.spark.sql.hive.thriftserver.SparkOperation:run:SparkOperation.scala:44',
'org.apache.spark.sql.hive.thriftserver.SparkOperation:run$:SparkOperation.scala:42',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:run:SparkExecuteStatementOperation.scala:46',
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:484',
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementAsync:HiveSessionImpl.java:472',
'org.apache.hive.service.cli.CLIService:executeStatementAsync:CLIService.java:310',
'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:438',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422',
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:53',
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1128',
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:628',
'java.lang.Thread:run:Thread.java:834',
'*org.apache.spark.sql.AnalysisException:The database name is not valid: MyProject.Zendesk;:69:44',
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1:applyOrElse:ResolveSessionCatalog.scala:384',
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1:applyOrElse:ResolveSessionCatalog.scala:48',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:$anonfun$resolveOperatorsUp$3:AnalysisHelper.scala:90',
'org.apache.spark.sql.catalyst.trees.CurrentOrigin$:withOrigin:TreeNode.scala:72',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:$anonfun$resolveOperatorsUp$1:AnalysisHelper.scala:90',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$:allowInvokingTransformsInAnalyzer:AnalysisHelper.scala:194',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:resolveOperatorsUp:AnalysisHelper.scala:86',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper:resolveOperatorsUp$:AnalysisHelper.scala:84',
'org.apache.spark.sql.catalyst.plans.logical.LogicalPlan:resolveOperatorsUp:LogicalPlan.scala:29',
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog:apply:ResolveSessionCatalog.scala:48',
'org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog:apply:ResolveSessionCatalog.scala:39',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$2:RuleExecutor.scala:149',
'scala.collection.LinearSeqOptimized:foldLeft:LinearSeqOptimized.scala:126',
'scala.collection.LinearSeqOptimized:foldLeft$:LinearSeqOptimized.scala:122',
'scala.collection.immutable.List:foldLeft:List.scala:89',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$1:RuleExecutor.scala:146',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$execute$1$adapted:RuleExecutor.scala:138',
'scala.collection.immutable.List:foreach:List.scala:392',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:execute:RuleExecutor.scala:138',
'org.apache.spark.sql.catalyst.analysis.Analyzer:org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext:Analyzer.scala:176',
'org.apache.spark.sql.catalyst.analysis.Analyzer:execute:Analyzer.scala:170',
'org.apache.spark.sql.catalyst.analysis.Analyzer:execute:Analyzer.scala:130',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:$anonfun$executeAndTrack$1:RuleExecutor.scala:116',
'org.apache.spark.sql.catalyst.QueryPlanningTracker$:withTracker:QueryPlanningTracker.scala:88',
'org.apache.spark.sql.catalyst.rules.RuleExecutor:executeAndTrack:RuleExecutor.scala:116',
'org.apache.spark.sql.catalyst.analysis.Analyzer:$anonfun$executeAndCheck$1:Analyzer.scala:154',
'org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$:markInAnalyzer:AnalysisHelper.scala:201',
'org.apache.spark.sql.catalyst.analysis.Analyzer:executeAndCheck:Analyzer.scala:153',
'org.apache.spark.sql.execution.QueryExecution:$anonfun$analyzed$1:QueryExecution.scala:69',
'org.apache.spark.sql.catalyst.QueryPlanningTracker:measurePhase:QueryPlanningTracker.scala:111',
'org.apache.spark.sql.execution.QueryExecution:$anonfun$executePhase$1:QueryExecution.scala:134',
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768',
'org.apache.spark.sql.execution.QueryExecution:executePhase:QueryExecution.scala:134',
'org.apache.spark.sql.execution.QueryExecution:analyzed$lzycompute:QueryExecution.scala:69',
'org.apache.spark.sql.execution.QueryExecution:analyzed:QueryExecution.scala:67',
'org.apache.spark.sql.execution.QueryExecution:assertAnalyzed:QueryExecution.scala:59',
'org.apache.spark.sql.Dataset$:$anonfun$ofRows$2:Dataset.scala:99',
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768',
'org.apache.spark.sql.Dataset$:ofRows:Dataset.scala:97',
'org.apache.spark.sql.ascend.SessionCatalog:dataframeFor:SessionCatalog.scala:162',
'org.apache.spark.sql.SparkSession:$anonfun$sql$1:SparkSession.scala:609',
'org.apache.spark.sql.SparkSession:withActive:SparkSession.scala:768',
'org.apache.spark.sql.SparkSession:sql:SparkSession.scala:601',
'org.apache.spark.sql.SQLContext:sql:SQLContext.scala:650',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:280'],
sqlState=None,
errorCode=0,
errorMessage='Error running query: org.apache.spark.sql.AnalysisException: The database name is not valid: MyProject.Zendesk;'),
operationHandle=None)
However, if I escape the schema name, there is no issue:
SHOW TABLES IN `MyProject.Zendesk`
database | tableName | isTemporary |
---|---|---|
MyProject.Zendesk | Consumer_Bundle_Transform | false |
MyProject.Zendesk | Zendesk_Webhook | false |
DEBUG:impala.hiveserver2:GetResultSetMetadata: resp=TGetResultSetMetadataResp(status=TStatus(statusCode=0,
infoMessages=None,
sqlState=None,
errorCode=None,
errorMessage=None),
schema=TTableSchema(columns=[TColumnDesc(columnName='database',
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7,
typeQualifiers=None),
arrayEntry=None,
mapEntry=None,
structEntry=None,
unionEntry=None,
userDefinedTypeEntry=None)]),
position=1,
comment=''),
TColumnDesc(columnName='tableName',
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7,
typeQualifiers=None),
arrayEntry=None,
mapEntry=None,
structEntry=None,
unionEntry=None,
userDefinedTypeEntry=None)]),
position=2,
comment=''),
TColumnDesc(columnName='isTemporary',
typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=0,
typeQualifiers=None),
arrayEntry=None,
mapEntry=None,
structEntry=None,
unionEntry=None,
userDefinedTypeEntry=None)]),
position=3,
comment='')]))
DEBUG:impala.hiveserver2:get_result_schema: schema=[('database',
'STRING',
None,
None,
None,
None,
None),
('tableName',
'STRING',
None,
None,
None,
None,
None),
('isTemporary',
'BOOLEAN',
None,
None,
None,
None,
None)]
So, given the above, what 1. does is escape the schema to prevent the issue with unescaped schemas (my forthcoming change will make it strip and re-insert if already exists).
What 2 does is check if the return of show tables is more than one column, if it is more than one column, it will grab the second column. Otherwise, it work as it currently does and return the first column.
@danielewood Thanks for the explanation, it makes sense. I trust you tested the patch, right? Can I go ahead and merge it in?
@attilajeges I have tested these patches in my environment, but I dont have a standard instance to test against to validate a lack of issues with non-Spark Hive2 or Impala.
That said, as long as Hive2/Impala accept an escaped(with backticks) database name, there should be no issue whatsoever.
If the following commands work on standard Hive2/Impala, then this patch will have no impact on them:
SHOW TABLES IN `database_name_here`
SELECT * FROM `database_name_here`.table_name_here LIMIT 1
superset_app | 2021-05-18 16:53:14,581 INFO sqlalchemy.engine.base.Engine SHOW TABLES IN `MyProject.Zendesk`
superset_app | 2021-05-18 16:53:14,580:DEBUG:impala.hiveserver2:Query finished
superset_app | 2021-05-18 16:53:14,581:INFO:sqlalchemy.engine.base.Engine:SHOW TABLES IN `MyProject.Zendesk`
superset_app | 2021-05-18 16:53:14,581:INFO:sqlalchemy.engine.base.Engine:{}
superset_app | 2021-05-18 16:53:14,581 INFO sqlalchemy.engine.base.Engine {}
superset_app | 2021-05-18 16:53:14,581:DEBUG:impala.hiveserver2:Executing query SHOW TABLES IN `MyProject.Zendesk`
@danielewood Thanks for clarifying. Putting database names between ` works in impala and hive, so I think the patch is good to go.
Two parts of this change.
Always escape the schema name when running SHOW TABLES. This fixes an error in apache/superset when querying a hive database partition that is separated with a dot.
If show tables returns more than one column, it is assumed it is Hive returning the database name along with the partition. This fixes https://github.com/apache/superset/issues/8027