Performance degradation on tunneled connection

ExpediaGroup / waggle-dance

Hive federation service. Enables disparate tables to be concurrently accessed across multiple Hive deployments.

Apache License 2.0

273 stars 76 forks source link

Performance degradation on tunneled connection #115

Closed bAndie91 closed 6 years ago

bAndie91 commented 6 years ago

I've experienced performance degradation when upgraded from 2.2.2 to 2.3.7. see measurements in attachment which was made by Spark application calling spark.catalog.listTables(). newer WD is 3 times slower impacting the ssh-tunneled connections (see highlighted rows) the most.

how much can it be eliminated?

patduin commented 6 years ago

Any chance you can narrow down the version range? That would really help.

rambrus commented 6 years ago

@bAndie91 , @patduin: I did some investigation and found this:

Test Case 1 - listTables on non-tunneled connection

wd version	run-1	run-2
2.3.0	20 s	19 s
2.3.1	20 s	23 s
2.3.2	25 s	24 s
2.3.3	36 s	32 s
2.3.4	28 s	25 s
2.3.5	44 s	56 s
2.3.6	48 s	46 s

Test Case 2 - listTables on tunneled connection

wd version	run-1	run-2
2.3.0	5 m 37 s	4 m 52 s
2.3.1	4 m 59 s	5 m 04 s
2.3.2	5 m 15 s	5 m 11 s
2.3.3	7 m 27 s	7 m 12 s
2.3.4	5 m 14 s	5 m 15 s
2.3.5	13 m 02 s	13 m 27 s
2.3.6	12 m 33 s	13 m 07 s

Summary I had 2 runs for both test cases, the test durations are pretty much consistent between runs and we can observe ~150% performance degradation between 2.3.4 and 2.3.5 releases.

IMPORTANT: The performance degradation does not seem to be specific to tunneled connection, the same trend can be observed in both cases.

patduin commented 6 years ago

Excellent work! Got one more questions in your WD configuration are all metastores reachable and responding or is one of them down?

rambrus commented 6 years ago

@bAndie91 can you please answer the question above?

I also checked the spark logs if they can contain any unusual error. I have seen this error many times in the logs starting from v2.3.3:

This might be also interesting.

patduin commented 6 years ago

I think I know what is going on. A change I made related to #73 . Does an extra call to verify the connection is open.

patduin commented 6 years ago

Can I ask you to try and build/run this branch: https://github.com/HotelsDotCom/waggle-dance/tree/issue-115 I suspect the changed line is what is causing the issue or at least I want to rule this out. I'll also make an internal ticket for us to setup performance tests as we should be catching these issues. Apologies for that.

rambrus commented 6 years ago

@patduin Sure, will check that branch and get back to you with the results.

bAndie91 commented 6 years ago

@patduin all the metastore connections are AVAILABLE during the tests run.

rambrus commented 6 years ago

@bAndie91 , @patduin

Re-run the test cases on the version built from issue-115 branch and added the results to the charts.

Test Case 1 - listTables on non-tunneled connection

Test Case 2 - listTables on tunneled connection

I can confirm that the fix resolves the performance degradation issue.

Also the RetryingMetaStoreClient:184 - MetaStoreClient lost connection. Attempting to reconnect. warning disappeared from the logs.

patduin commented 6 years ago

ok thanks really helpful! We'll need to find some other way to fix #73 without introducing the performance hit. Not sure yet how but at least we know what is going on :)

rambrus commented 6 years ago

cheers, let us know when the fix is available, we are happy to take a quick look at the performance.

patduin commented 6 years ago

Will do and thanks!

patduin commented 6 years ago

@rambrus I've updated the branch, I've managed to avoid the issue for normal connections but you'll see the degradation in tunneled connections still. I haven't found a way to work around this without sacrificing functionality. Would be great if you could test this. We could at least release this and if the performance is a big issue focus on that in some future PR.

rambrus commented 6 years ago

@patduin sure, will take a look and get back to you with results.

rambrus commented 6 years ago

@patduin : executed the tests on 4791baf and added the result to the chart.

I can see some performance degradation in both cases, but it's not so critical than in v2.3.5.

patduin commented 6 years ago

yeah I can't really account for that. We merged the PR with the changes and try to make a release this week.

patduin commented 6 years ago

This is adressed in 2.4.2 release, if the performance is still an issue please reopen or open a new ticket, closing this.