Closed bAndie91 closed 6 years ago
Any chance you can narrow down the version range? That would really help.
@bAndie91 , @patduin: I did some investigation and found this:
Test Case 1 - listTables on non-tunneled connection
wd version | run-1 | run-2 |
---|---|---|
2.3.0 | 20 s | 19 s |
2.3.1 | 20 s | 23 s |
2.3.2 | 25 s | 24 s |
2.3.3 | 36 s | 32 s |
2.3.4 | 28 s | 25 s |
2.3.5 | 44 s | 56 s |
2.3.6 | 48 s | 46 s |
Test Case 2 - listTables on tunneled connection
wd version | run-1 | run-2 |
---|---|---|
2.3.0 | 5 m 37 s | 4 m 52 s |
2.3.1 | 4 m 59 s | 5 m 04 s |
2.3.2 | 5 m 15 s | 5 m 11 s |
2.3.3 | 7 m 27 s | 7 m 12 s |
2.3.4 | 5 m 14 s | 5 m 15 s |
2.3.5 | 13 m 02 s | 13 m 27 s |
2.3.6 | 12 m 33 s | 13 m 07 s |
Summary I had 2 runs for both test cases, the test durations are pretty much consistent between runs and we can observe ~150% performance degradation between 2.3.4 and 2.3.5 releases.
IMPORTANT: The performance degradation does not seem to be specific to tunneled connection, the same trend can be observed in both cases.
Excellent work! Got one more questions in your WD configuration are all metastores reachable and responding or is one of them down?
@bAndie91 can you please answer the question above?
I also checked the spark logs if they can contain any unusual error. I have seen this error many times in the logs starting from v2.3.3:
This might be also interesting.
I think I know what is going on. A change I made related to #73 . Does an extra call to verify the connection is open.
Can I ask you to try and build/run this branch: https://github.com/HotelsDotCom/waggle-dance/tree/issue-115 I suspect the changed line is what is causing the issue or at least I want to rule this out. I'll also make an internal ticket for us to setup performance tests as we should be catching these issues. Apologies for that.
@patduin Sure, will check that branch and get back to you with the results.
@patduin all the metastore connections are AVAILABLE during the tests run.
@bAndie91 , @patduin
Re-run the test cases on the version built from issue-115 branch and added the results to the charts.
Test Case 1 - listTables on non-tunneled connection
Test Case 2 - listTables on tunneled connection
I can confirm that the fix resolves the performance degradation issue.
Also the RetryingMetaStoreClient:184 - MetaStoreClient lost connection. Attempting to reconnect. warning disappeared from the logs.
ok thanks really helpful! We'll need to find some other way to fix #73 without introducing the performance hit. Not sure yet how but at least we know what is going on :)
cheers, let us know when the fix is available, we are happy to take a quick look at the performance.
Will do and thanks!
@rambrus I've updated the branch, I've managed to avoid the issue for normal connections but you'll see the degradation in tunneled connections still. I haven't found a way to work around this without sacrificing functionality. Would be great if you could test this. We could at least release this and if the performance is a big issue focus on that in some future PR.
@patduin sure, will take a look and get back to you with results.
@patduin : executed the tests on 4791baf and added the result to the chart.
I can see some performance degradation in both cases, but it's not so critical than in v2.3.5.
yeah I can't really account for that. We merged the PR with the changes and try to make a release this week.
This is adressed in 2.4.2 release, if the performance is still an issue please reopen or open a new ticket, closing this.
I've experienced performance degradation when upgraded from 2.2.2 to 2.3.7. see measurements in attachment which was made by Spark application calling
spark.catalog.listTables()
. newer WD is 3 times slower impacting the ssh-tunneled connections (see highlighted rows) the most.how much can it be eliminated?