cloudera / hue

Open source SQL Query Assistant service for Databases/Warehouses
https://cloudera.com
Apache License 2.0
1.17k stars 366 forks source link

Error showing tables when using Spark thriftserver org.apache.hive.service.cli.HiveSQLException: Error operating GET_SCHEMAS Dangling meta character '*' near index 0 * ^ #2252

Closed simonvanderveldt closed 3 years ago

simonvanderveldt commented 3 years ago

When using the Spark thriftserver Hue shows an error and doesn't show any tables in the list of tables. The error message shown by Hue is

Error operating GET_SCHEMAS Dangling meta character '*' near index 0 * ^

In the logs of Hue the following can be found

hue_1                 | [16/Jun/2021 03:10:09 -0700] api          ERROR    Autocomplete data fetching error
hue_1                 | Traceback (most recent call last):
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/api.py", line 120, in _autocomplete
hue_1                 |     response['databases'] = db.get_databases()
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/dbms.py", line 337, in get_databases
hue_1                 |     databases = self.client.get_databases(schemaName=database_names)
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 1504, in get_databases
hue_1                 |     return [table[col] for table in self._client.get_databases(schemaName)]
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 820, in get_databases
hue_1                 |     (res, session) = self.call(self._client.GetSchemas, req)
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 732, in call
hue_1                 |     return self.call_return_result_and_session(fn, req, status, session=session)
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 771, in call_return_result_and_session
hue_1                 |     return self._call_return_result_and_session(fn, req, status=status, session=session)
hue_1                 |   File "/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py", line 794, in _call_return_result_and_session
hue_1                 |     raise QueryServerException(Exception(message), message=message)
hue_1                 | beeswax.server.dbms.QueryServerException: Error operating GET_SCHEMAS Dangling meta character '*' near index 0
hue_1                 | *
hue_1                 | ^

And in the logs of the Spark thriftserver it looks like this

spark-thriftserver_1  | 21/06/16 10:10:09 INFO SparkGetSchemasOperation: Listing databases 'catalog : null, schemaPattern : *' with dee70cc0-a309-4969-a0e0-94abc0da27aa
spark-thriftserver_1  | 21/06/16 10:10:09 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user admin: id: ‘admin’: no such user
spark-thriftserver_1  | id: ‘admin’: no such user
spark-thriftserver_1  | 
spark-thriftserver_1  | 21/06/16 10:10:09 INFO SQLStdHiveAccessController: Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=f082e719-2034-4a0f-b0d4-a6634f842a70, clientType=HIVESERVER2]
spark-thriftserver_1  | 21/06/16 10:10:09 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Opened a connection to metastore, current connections: 2
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Connected to metastore.
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Closed a connection to metastore, current connections: 1
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Opened a connection to metastore, current connections: 2
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Connected to metastore.
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Closed a connection to metastore, current connections: 1
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Opened a connection to metastore, current connections: 2
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Connected to metastore.
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Closed a connection to metastore, current connections: 1
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Opened a connection to metastore, current connections: 2
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Connected to metastore.
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Closed a connection to metastore, current connections: 1
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Trying to connect to metastore with URI thrift://hive-metastore:9083
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Opened a connection to metastore, current connections: 2
spark-thriftserver_1  | 21/06/16 10:10:09 INFO metastore: Connected to metastore.
spark-thriftserver_1  | 21/06/16 10:10:09 ERROR SparkGetSchemasOperation: Error operating GET_SCHEMAS with dee70cc0-a309-4969-a0e0-94abc0da27aa
spark-thriftserver_1  | java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
spark-thriftserver_1  | *
spark-thriftserver_1  | ^
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.error(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.sequence(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.expr(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.compile(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.<init>(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.compile(Unknown Source)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.runInternal(SparkGetSchemasOperation.scala:76)
spark-thriftserver_1  |     at org.apache.hive.service.cli.operation.Operation.run(Operation.java:278)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.$anonfun$run$1(SparkOperation.scala:44)
spark-thriftserver_1  |     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.withLocalProperties(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.run(SparkOperation.scala:44)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.run$(SparkOperation.scala:42)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.run(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionImpl.getSchemas(HiveSessionImpl.java:548)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
spark-thriftserver_1  |     at java.base/java.lang.reflect.Method.invoke(Unknown Source)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
spark-thriftserver_1  |     at java.base/java.security.AccessController.doPrivileged(Native Method)
spark-thriftserver_1  |     at java.base/javax.security.auth.Subject.doAs(Unknown Source)
spark-thriftserver_1  |     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
spark-thriftserver_1  |     at com.sun.proxy.$Proxy21.getSchemas(Unknown Source)
spark-thriftserver_1  |     at org.apache.hive.service.cli.CLIService.getSchemas(CLIService.java:349)
spark-thriftserver_1  |     at org.apache.hive.service.cli.thrift.ThriftCLIService.GetSchemas(ThriftCLIService.java:499)
spark-thriftserver_1  |     at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
spark-thriftserver_1  |     at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
spark-thriftserver_1  |     at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
spark-thriftserver_1  |     at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
spark-thriftserver_1  |     at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
spark-thriftserver_1  |     at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
spark-thriftserver_1  |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
spark-thriftserver_1  |     at java.base/java.lang.Thread.run(Unknown Source)
spark-thriftserver_1  | 21/06/16 10:10:09 INFO SparkGetSchemasOperation: Close statement with dee70cc0-a309-4969-a0e0-94abc0da27aa
spark-thriftserver_1  | 21/06/16 10:10:09 WARN ThriftCLIService: Error getting schemas: 
spark-thriftserver_1  | org.apache.hive.service.cli.HiveSQLException: Error operating GET_SCHEMAS Dangling meta character '*' near index 0
spark-thriftserver_1  | *
spark-thriftserver_1  | ^
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:105)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:97)
spark-thriftserver_1  |     at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.runInternal(SparkGetSchemasOperation.scala:82)
spark-thriftserver_1  |     at org.apache.hive.service.cli.operation.Operation.run(Operation.java:278)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.$anonfun$run$1(SparkOperation.scala:44)
spark-thriftserver_1  |     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.withLocalProperties(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.run(SparkOperation.scala:44)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkOperation.run$(SparkOperation.scala:42)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.run(SparkGetSchemasOperation.scala:39)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionImpl.getSchemas(HiveSessionImpl.java:548)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
spark-thriftserver_1  |     at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
spark-thriftserver_1  |     at java.base/java.lang.reflect.Method.invoke(Unknown Source)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
spark-thriftserver_1  |     at java.base/java.security.AccessController.doPrivileged(Native Method)
spark-thriftserver_1  |     at java.base/javax.security.auth.Subject.doAs(Unknown Source)
spark-thriftserver_1  |     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
spark-thriftserver_1  |     at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
spark-thriftserver_1  |     at com.sun.proxy.$Proxy21.getSchemas(Unknown Source)
spark-thriftserver_1  |     at org.apache.hive.service.cli.CLIService.getSchemas(CLIService.java:349)
spark-thriftserver_1  |     at org.apache.hive.service.cli.thrift.ThriftCLIService.GetSchemas(ThriftCLIService.java:499)
spark-thriftserver_1  |     at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1617)
spark-thriftserver_1  |     at org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetSchemas.getResult(TCLIService.java:1602)
spark-thriftserver_1  |     at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
spark-thriftserver_1  |     at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
spark-thriftserver_1  |     at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
spark-thriftserver_1  |     at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
spark-thriftserver_1  |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
spark-thriftserver_1  |     at java.base/java.lang.Thread.run(Unknown Source)
spark-thriftserver_1  | Caused by: java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
spark-thriftserver_1  | *
spark-thriftserver_1  | ^
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.error(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.sequence(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.expr(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.compile(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.<init>(Unknown Source)
spark-thriftserver_1  |     at java.base/java.util.regex.Pattern.compile(Unknown Source)
spark-thriftserver_1  |     at org.apache.spark.sql.hive.thriftserver.SparkGetSchemasOperation.runInternal(SparkGetSchemasOperation.scala:76)
spark-thriftserver_1  |     ... 34 more

After some digging it seems like Spark's thriftserver doesn't like the * that's being passed as schema name, so I made a change setting req.schemaName = None similar to what's being done for impala https://github.com/cloudera/hue/blob/3e60d9fe893ffb7294716b206cec935c11888fe1/apps/beeswax/src/beeswax/server/hive_server2_lib.py#L818 After that it works fine.

I'm not sure this is the correct fix though. I guess it could also be a bug in Spark's thriftserver because it's not fully compatible with hiveserver2? I don't have a regular hive environment to check if it's actually working though, so not sure.

Versions used:

I found this issue https://github.com/cloudera/hue/issues/850 and this PR that seem to talk about similar issues but the PR contains a lot more change. Also both the issue and the PR were closed without a fix.

romainr commented 3 years ago

Thanks for the details!

Indeed, would you send a PR to add sparksql as a dialect where we should send None? (same line and same as Impala)

FYI: there was also some good testing by using the SqlAlchemy interface of Hue too https://gethue.com/blog/querying-spark-sql-with-spark-thrift-server-and-hue-editor/ (and not the native implementation of Hive Thrift of Hue)

simonvanderveldt commented 3 years ago

@romainr Thanks for the quick reply. I can create a PR no problem.

Do you think/know if it's the correct fix? Or should I create a bug for Spark because the thriftserver isn't 100% compatible with hiveserver2?

Thanks for the link! I prefer the direct connection to the thriftserver because it's one less component and because of the issues with blocking/long running queries mentioned on the blog post. I'm hoping it's good enough or we can get it to good enough soon :)

One question: In this image from the blog post it does show a list of tables. Were you using something else than Spark's thriftserver for populating this or did you not run into this issue?

romainr commented 3 years ago

Good point, just by experience it is much simpler to just "workaround" it in Hue so +1 for the PR ;)

Indeed for the async queries in SqlAlchemy!

For the image, I don't remember. Are the INSERTs failing? Maybe there are some corner cases too.

simonvanderveldt commented 3 years ago

@romainr I'm not sure if I'm doing something wrong, but the dialect is shown as beeswax. This is the full dict from self.query_server

'server_name': 'beeswax', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'beeswax'}

For context, my hue.ini looks like this:

[desktop]
  time_zone=Europe/Amsterdam

  [[database]]
    engine=postgres
    host=postgres
    user=hue
    password=hue
    name=hue

[beeswax]
  hive_server_host=spark-thriftserver
  hive_server_port=10000
  hive_metastore_host=hive-metastore

[spark]
  sql_server_host=spark-thriftserver
  sql_server_port=10000

[notebook]
  [[interpreters]]
    [[[hive]]]
      name=Hive
      interface=hiveserver2
    [[[sparksql]]]
      name=Spark SQL
      interface=hiveserver2

For the image, I don't remember. Are the INSERTs failing? Maybe there are some corner cases too.

It seems like everything is working fine after setting schemaName=None.

romainr commented 3 years ago

Would recommend to just get a quick PR, we can get a better solution when more bandwidth and working on the new connector system.

simonvanderveldt commented 3 years ago

@romainr I'm not really sure how to proceed. There's nothing available in the self.query_server dict that informs us that we're using spark's thriftserver and I assume we don't want to set the schemaName = None for beeswax since that would also apply to Hive's hiveserver2. If we could only fix it for sparksql it would already be an improvement, but I'm not sure how I can determine that we're running sparksql in this part of the code.

romainr commented 3 years ago

If you print the query_server dict, are you sure you don't see any sparksql? (i.e. don't use the 'hive' imterpreter but the 'sparksql' one) https://github.com/cloudera/hue/blob/master/apps/beeswax/src/beeswax/server/dbms.py#L241

simonvanderveldt commented 3 years ago

If you print the query_server dict, are you sure you don't see any sparksql? (i.e. don't use the 'hive' imterpreter but the 'sparksql' one) https://github.com/cloudera/hue/blob/master/apps/beeswax/src/beeswax/server/dbms.py#L241

@romainr you're right, when using the sparksql interpreter I do see sparksql in self.query_server

{'server_name': 'sparksql', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'sparksql'}

I can create a PR for this.

But when using Hive (which is using the same hiveserver2 interface to the same Spark thriftserver) there's no way to differentiate between hiveserver2 and spark's thriftserver

{'server_name': 'beeswax', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'beeswax'}

We only need SparkSQL, so I'll just simply remove the hive interpreter in our config. I don't know if there's any reason to use the hive interpreter on top of Spark's thriftserver or if it makes no sense and one should simply always use the sparksql interpreter instead?

romainr commented 3 years ago

No reason to have the Hive dialect in your case indeed!

simonvanderveldt commented 3 years ago

OK, then I guess we can close this. Thanks for the help and quick responses! :+1:

pratik4891 commented 1 year ago

Can anyone please let me know how to resolve this issue , currently getting the same error for hive connection from hue