Zookeeper service discovery for HS2

sandredd commented 5 years ago

Hue makes connection to hs2 through ZK service discovery. But if the hs2 is down hue doesnt make connection to different hs2 unless hue is restarted.

I see cache timeout set to 60 by default. Is it expected that zk will automatically reroute hs2 connection or does it do in interval?

jdesjean commented 5 years ago

Looks like HS2-ZK timeout is configurable via hive.zookeeper.session.timeout https://community.hortonworks.com/content/supportkb/48980/how-to-configure-zookeeper-discovery-for-hiveserve.html

sandredd commented 5 years ago

We have the settings enabled for ZK.

Problem is hue gets connection to hs2 and if the hs2 goes down it doesn’t get another hs2 until we restart hue. Until restart users face beeswax hive timeouts.

Also all the users making connection to Hue are going through same hs2 until hue resets the session with existing hs2

jdesjean commented 5 years ago

If you'd like to contribute, you could add user specific handling here: https://github.com/cloudera/hue/blob/cbfeb03b7303a77f9cc330f3e9e6c52a9e9ec984/apps/beeswax/src/beeswax/server/dbms.py#L99 Handling the failures would be a bit more complicated.

alericmckearn commented 5 years ago

So this was noted as an issue when i did the PR.

LLAP has an active passive system that can be checked, so the cache has a timeout that allows for validating the current active node. Hiveserver2 though is stateful in that queries distributed to a hiveserver2 node are lost if you switch to another hiveserver2. So rechecking will only fix an issue if the hiveserver2 znode has been removed (which isn't what happens most of the time).

The proper place to fix this is to have error trapping off of the thrift response which invalidates the cache so the next try will reset the hiveserver2. EXCEPT that the in memory cache doesn't work this way since it is stateless between gunicorn threads. Basically we'd end up with a different hiveserver2 for each worker.

What we talked about was waiting for the redis/celery work to be done and then create the ability to not only load balance the hiveserver2 but to mark them as healthy etc.

sandredd commented 5 years ago

Hi @alericmckearn Thanks for the response.

Can you please with the code where this can be included:

The proper place to fix this is to have error trapping off of the thrift response which invalidates the cache so the next try will reset the hiveserver2. EXCEPT that the in memory cache doesn't work this way since it is stateless between gunicorn threads. Basically we'd end up with a different hiveserver2 for each worker.

Also can you share some more information or point to wiki/guide on redis/Celery work.

alericmckearn commented 5 years ago

This goes over the issues with the local memory caching (file based broke docker)

https://docs.djangoproject.com/en/2.2/topics/cache/#local-memory-caching

Note that each process will have its own private cache instance, which means no cross-process caching is possible. This obviously also means the local memory cache isn’t particularly memory-efficient, so it’s probably not a good choice for production environments. It’s nice for development.

redis/celery https://issues.cloudera.org/browse/HUE-8743 It looks like romain may have gotten the first pass in.

The exceptions are passed to 2 different classes.

https://github.com/cloudera/hue/blob/cbfeb03b7303a77f9cc330f3e9e6c52a9e9ec984/apps/beeswax/src/beeswax/server/dbms.py#L222

Inside of these classes, we need to add an exception checker that knows system errors from user errors. since there is no retry logic, it would all be based on a user clicking again. to make it seamless, i was thinking about adding the retrying library on all of these commands => https://pypi.org/project/retrying/

There are 2 main issue we'd be looking for => can't reach the server and the server session handle is different than what you have (which is what happens if hiveserver2 is restarted but is able to be reached). at that point we'd invalidate the cache, re get the queryserver and then retry the command.

alericmckearn commented 5 years ago

be my guest.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity and is not "roadmap" labeled or part of any milestone. Remove stale label or comment or this will be closed in 5 days.

cloudera / hue

Zookeeper service discovery for HS2 #945