Closed simonvanderveldt closed 3 years ago
Thanks for the details!
Indeed, would you send a PR to add sparksql
as a dialect where we should send None? (same line and same as Impala)
FYI: there was also some good testing by using the SqlAlchemy interface of Hue too https://gethue.com/blog/querying-spark-sql-with-spark-thrift-server-and-hue-editor/ (and not the native implementation of Hive Thrift of Hue)
@romainr Thanks for the quick reply. I can create a PR no problem.
Do you think/know if it's the correct fix? Or should I create a bug for Spark because the thriftserver isn't 100% compatible with hiveserver2?
Thanks for the link! I prefer the direct connection to the thriftserver because it's one less component and because of the issues with blocking/long running queries mentioned on the blog post. I'm hoping it's good enough or we can get it to good enough soon :)
One question: In this image from the blog post it does show a list of tables. Were you using something else than Spark's thriftserver for populating this or did you not run into this issue?
Good point, just by experience it is much simpler to just "workaround" it in Hue so +1 for the PR ;)
Indeed for the async queries in SqlAlchemy!
For the image, I don't remember. Are the INSERTs failing? Maybe there are some corner cases too.
@romainr I'm not sure if I'm doing something wrong, but the dialect is shown as beeswax
.
This is the full dict from self.query_server
'server_name': 'beeswax', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'beeswax'}
For context, my hue.ini
looks like this:
[desktop]
time_zone=Europe/Amsterdam
[[database]]
engine=postgres
host=postgres
user=hue
password=hue
name=hue
[beeswax]
hive_server_host=spark-thriftserver
hive_server_port=10000
hive_metastore_host=hive-metastore
[spark]
sql_server_host=spark-thriftserver
sql_server_port=10000
[notebook]
[[interpreters]]
[[[hive]]]
name=Hive
interface=hiveserver2
[[[sparksql]]]
name=Spark SQL
interface=hiveserver2
For the image, I don't remember. Are the INSERTs failing? Maybe there are some corner cases too.
It seems like everything is working fine after setting schemaName=None
.
Would recommend to just get a quick PR, we can get a better solution when more bandwidth and working on the new connector system.
@romainr I'm not really sure how to proceed. There's nothing available in the self.query_server
dict that informs us that we're using spark's thriftserver and I assume we don't want to set the schemaName = None
for beeswax
since that would also apply to Hive's hiveserver2.
If we could only fix it for sparksql it would already be an improvement, but I'm not sure how I can determine that we're running sparksql in this part of the code.
If you print the query_server dict, are you sure you don't see any sparksql
? (i.e. don't use the 'hive' imterpreter but the 'sparksql' one) https://github.com/cloudera/hue/blob/master/apps/beeswax/src/beeswax/server/dbms.py#L241
If you print the query_server dict, are you sure you don't see any
sparksql
? (i.e. don't use the 'hive' imterpreter but the 'sparksql' one) https://github.com/cloudera/hue/blob/master/apps/beeswax/src/beeswax/server/dbms.py#L241
@romainr you're right, when using the sparksql
interpreter I do see sparksql
in self.query_server
{'server_name': 'sparksql', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'sparksql'}
I can create a PR for this.
But when using Hive (which is using the same hiveserver2 interface to the same Spark thriftserver) there's no way to differentiate between hiveserver2 and spark's thriftserver
{'server_name': 'beeswax', 'server_host': 'spark-thriftserver', 'server_port': 10000, 'principal': None, 'http_url': 'http://spark-thriftserver:10001/cliservice', 'transport_mode': 'socket', 'auth_username': 'hue', 'auth_password': None, 'use_sasl': True, 'close_sessions': False, 'has_session_pool': False, 'max_number_of_sessions': 1, 'dialect': 'beeswax'}
We only need SparkSQL, so I'll just simply remove the hive interpreter in our config. I don't know if there's any reason to use the hive interpreter on top of Spark's thriftserver or if it makes no sense and one should simply always use the sparksql interpreter instead?
No reason to have the Hive dialect in your case indeed!
OK, then I guess we can close this. Thanks for the help and quick responses! :+1:
Can anyone please let me know how to resolve this issue , currently getting the same error for hive connection from hue
When using the Spark thriftserver Hue shows an error and doesn't show any tables in the list of tables. The error message shown by Hue is
In the logs of Hue the following can be found
And in the logs of the Spark thriftserver it looks like this
After some digging it seems like Spark's thriftserver doesn't like the
*
that's being passed as schema name, so I made a change settingreq.schemaName = None
similar to what's being done for impala https://github.com/cloudera/hue/blob/3e60d9fe893ffb7294716b206cec935c11888fe1/apps/beeswax/src/beeswax/server/hive_server2_lib.py#L818 After that it works fine.I'm not sure this is the correct fix though. I guess it could also be a bug in Spark's thriftserver because it's not fully compatible with hiveserver2? I don't have a regular hive environment to check if it's actually working though, so not sure.
Versions used:
I found this issue https://github.com/cloudera/hue/issues/850 and this PR that seem to talk about similar issues but the PR contains a lot more change. Also both the issue and the PR were closed without a fix.