Massive performance hit

BastiKiel commented 9 years ago

We are currently running the OpenNTF API in the version released as the Runtime Essentials back in December (version 1.0.0.201312231409). When enabling the latest version of the API (1.5.0.201409241744) we are experiencing a massive performance hit for our XPage applications. In production where our servers perform smoothly, enabling the new version leads to the server not keeping up with incoming users in the morning rendering it unresponsive (most of the users are configured to open a specific application). A simple load test we perform verifies the behavior also for our test servers. Here the server takes roughly twice the amount of time to deliver the same amount of requests to the application compared to the older version. The load test simply starts a new thread every 100ms which reads the same URL (entry Page for our main application). For a fixed amount of threads we watch the time it takes to complete those threads. Using the RC2 version of the API the load test roughly runs twice as long. We can toggle the used version and the server speeds up or slows respectively. We've tested it against different applications and different XPages, getting the same result every time. We don't use the plugin yet. We are just in the stage of getting it out so we can utilize it in the future. Currently we are running Domino 9.0.1FP1 on Windows 2003/2008/64bit servers. We have installed the ExtensionLibrary version 9.0.1.v00_08_20140729-0805 as well as OpenNTF Essentials Runtime modules as of version 901v2.

the-ntf commented 9 years ago

I'll look into it this week. Any chance I could get you to compile and test with this version from March? https://github.com/OpenNTF/org.openntf.domino/releases/tag/RPD-Ephesus

On Mon, Oct 20, 2014 at 3:34 AM, BastiKiel notifications@github.com wrote:

We are currently running the OpenNTF API in the version released as the Runtime Essentials back in December (version 1.0.0.201312231409). When enabling the latest version of the API (1.5.0.201409241744) we are experiencing a massive performance hit for our XPage applications. In production where our servers perform smoothly, enabling the new version leads to the server not keeping up with incoming users in the morning rendering it unresponsive (most of the users are configured to open a specific application). A simple load test we perform verifies the behavior also for our test servers. Here the server takes roughly twice the amount of time to deliver the same amount of requests to the application compared to the older version. The load test simply starts a new thread every 100ms which reads the same URL (entry Page for our main application). For a fixed amount of threads we watch the time it takes to complete those threads. Using the RC2 version of the API the load test roughly runs twice as long. We can toggle the used version and the server speeds up or slows respectively. We've tested it against different applications and different XPages, getting the same result every time. We don't use the plugin yet. We are just in the stage of getting it out so we can utilize it in the future. Currently we are running Domino 9.0.1FP1 on Windows 2003/2008/64bit servers. We have installed the ExtensionLibrary version 9.0.1.v00_08_20140729-0805 as well as OpenNTF Essentials Runtime modules as of version 901v2.

— Reply to this email directly or view it on GitHub https://github.com/OpenNTF/org.openntf.domino/issues/128.

jesse-gallagher commented 9 years ago

To help track it down, could you try using the XPages Toolbox backend profile before and after? That would at least indicate whether it's a case of ODA making noticeably more or different backend calls to slow it down vs. all in-ODA code.

the-ntf commented 9 years ago

Also it would be helpful to know what kind of data sources are attached to that view page.

On Mon, Oct 20, 2014 at 9:35 AM, Jesse Gallagher notifications@github.com wrote:

To help track it down, could you try using the XPages Toolbox backend profile before and after? That would at least indicate whether it's a case of ODA making noticeably more or different backend calls to slow it down vs. all in-ODA code.

— Reply to this email directly or view it on GitHub https://github.com/OpenNTF/org.openntf.domino/issues/128#issuecomment-59754321 .

BastiKiel commented 9 years ago

@the-ntf: sure... I'll compile with that version, bring that to the test server and rerun the test. I'll also go through the page and list all datasources used there.

@jesse-gallagher: I'll look into that as well. But I have first dig into how to do that.

Thanks so far :-)

the-ntf commented 9 years ago

Reading back, I feel that I may have misunderstood the problem. Are you saying that you haven't yet enabled the ODA XSP library in your application, yet having the plugin simply INSTALLED on the server is slowing down regular XPages requests?

If so, it drastically narrows the potential cause -- because there's only one class involved then. And I can see why the threadlocal init/term process could affect you like this. (And I already have the solution written in my head.)

So if you can confirm this, then we should be able to fast-track a fix for it.

On Mon, Oct 20, 2014 at 10:46 AM, BastiKiel notifications@github.com wrote:

@the-ntf https://github.com/the-ntf: sure... I'll compile with that version, bring that to the test server and rerun the test. I'll also go through the page and list all datasources used there.

@jesse-gallagher https://github.com/jesse-gallagher: I'll look into that as well. But I have first dig into how to do that.

Thanks so far :-)

— Reply to this email directly or view it on GitHub https://github.com/OpenNTF/org.openntf.domino/issues/128#issuecomment-59768898 .

BastiKiel commented 9 years ago

That's absolutely right. We have the plugin simply installed on the server (part of an update site and enabled). Within the application the plugin is not enabled. The only plugin enabled within the application is the extension library. Just having the current version of the plugin enabled within the update site slows down XPage requests. Great to hear there might be a quick solution :-)

BastiKiel commented 9 years ago

Just to make sure I'm not delaying anything...is there any information I'm expected to provide?

I've gone through the page in question to gather what datasources are used there:

one "custom" java object
two different View datasources, each with a column filter value set

the-ntf commented 9 years ago

After considerable effort, I believe I've cleaned up the XSP plugin to avoid consuming resources unnecessarily. I've put a test version on Dropbox for people affected by this issue to try.

https://www.dropbox.com/s/ruja4en132dpfke/ODAUpdateSite03Nov2014.zip?dl=0

Any feedback on whether this fixes the problem would be much appreciated. I'd be happy to build & push a new version at that time.

Please note that after RC2.1, we did some significant restructuring to provide finer resolution on some projects by separating them as additional plugins, so this may require a recompile of existing consuming code in some cases.

BastiKiel commented 9 years ago

I've put the new version to test on our test server this morning. Unfortunately I don't experience any performance increase. Run time sticks to RC2.1 level or a tad slower. I've set out and ran my little test program against the different versions of the API. Here are the results (all averages from 9 runs, performing a get request against the main page of the application):

Opening 100 threads at 5ms interval OpenNTF API completely disabled: 9000ms December Version: 9300ms RC2.1 API: 15700ms Nov 3 API: 17300ms
Opening 100 threads at a 100ms interval No API: 11200ms December API: 10900ms RC2.1 API: 15300ms Nov 3 API: 16300ms
Opening 200 threads at a 100ms interval No API: 21500ms December API: 21500ms RC2.1 API: 31300ms Nov 3 API: 30100ms

The original report of the slowdown to half the speed was actually inaccurate, as the test server had Java debugging enabled (tests above were done with debugging disabled).

the-ntf commented 9 years ago

I'm trying to figure out what these times represent. When you say that opening 100 threads at 5ms intervals takes 9000ms, are you saying that the time from the start of the first request to the completion of the last request was 9 seconds? That would mean that it took 8.5 seconds to service the 100th request, yes?

If that's the case, then in the second version of the test, the average response time would be dramatically improved when you space out the requests. The 100th request takes 1.12 seconds.

Is my understanding correct?

the-ntf commented 9 years ago

Here's a new build in which the OpenntfHttpService is completely disabled.

https://www.dropbox.com/s/g2e8jrzpprsg2th/ODAUpdateSite04Nov2014.zip?dl=0

BastiKiel commented 9 years ago

You are absolutely right. I'm taking the time from the very beginning. Then I spawn a new thread (which performs the http get) every x ms. When finished starting all threads I start watching all those threads to find out when every one has gone inactive. This is the end time I use to get the time it took to finish completely. There's one addition though: on the test machine we have configured the default 40 worker threads. From my understanding that means that the last threads created probably will have to wait considerably before actually being processed.

I put the Nov 4 version onto the server going through those tests again, but I can't see any significant change. So it seems the OpenntfHttpService doesn't cause the behavior.

the-ntf commented 9 years ago

Are all of the request threads hitting the same url?

Sent from the desert of the real.

On Nov 5, 2014, at 08:41, Bastian Lotsch notifications@github.com wrote:

You are absolutely right. I'm taking the time from the very beginning. Then I spawn a new thread (which performs the http get) every x ms. When finished starting all threads I start watching all those threads to find out when every one has gone inactive. This is the end time I use to get the time it took to finish completely. There's one addition though: on the test machine we have configured the default 40 worker threads. From my understanding that means that the last threads created probably will have to wait considerably before actually being processed.

I put the Nov 4 version onto the server going through those tests again, but I can't see any significant change. So it seems the OpenntfHttpService doesn't cause the behavior.

— Reply to this email directly or view it on GitHub.

BastiKiel commented 9 years ago

Yep...always the same url.

BastiKiel commented 9 years ago

I realize it's probably early morning...so Good Morning! Btw.: to avoid memory problems the xsp properties of the application have been set to: xsp.persistence.mode=file

the-ntf commented 9 years ago

Could you add a test case to your set? Create 100 requests at intervals of 200ms.

I'd just like to confirm that the basic problem is an increase in initial response time.

It would also be useful if, once the initial request is responded, if there was a second request using the existing HttpSession. I would expect that processing the second request for the same user would take less time.

Sent from the desert of the real.

On Nov 5, 2014, at 09:56, Bastian Lotsch notifications@github.com wrote:

I realize it's probably early morning...so Good Morning! Btw.: to avoid memory problems the xsp properties of the application have been set to: xsp.persistence.mode=file

— Reply to this email directly or view it on GitHub.

BastiKiel commented 9 years ago

Sure thing - everything I can do to help resolve the issue I'll try to do :-)

As expected when spaced out like this, the run time is nearly the same whether the latest API is enabled or no API at all. The difference is something between 100ms to 200ms. Next thing I did was tracking each request duration to look for average, min and max times. For the new test case I got: December API: average 81ms, minimum 62ms, maximum 438ms Nov. 4 API: average 147ms, minimum 109ms, maximum 375ms On different runs results varied a bit for minimum and average times but were pretty close to the times above. Maximum times varied a lot more. Seemingly some threads spiked.

I also added another procedure to execute the get requests using an HttpClient object to perform two consecutive http get with the same user and session (at least that's what I hope to have achieved). For each of the two requests I clocked the duration. After spacing out the threads so they don't exceed the available workers on the server, I'd say your expectation was spot on. The first request usually was slower than the second, mostly reading 156ms vs. 110ms. As always there were exceptions to this rule, but overall the times are pretty close. When run against the December API the picture is the same with different times. For most threads the first request takes longer than the second. A combination seen often is 94ms vs 78ms.

BastiKiel commented 9 years ago

I'll be out of the office for next two weeks. I'll be picking up as soon as I'm back. Thanks already for your help! Bastian

the-ntf commented 9 years ago

New version to try out if you would...

https://www.dropbox.com/s/lg2yeg6hu9eb9qs/ODA%20Nov%2014%20Update.zip?dl=0

BastiKiel commented 9 years ago

Just back from my holiday with a full desk. But at least I managed to squeeze in a very short test for the version above. This looks very promising :-) I'll hope to dive deeper tomorrow morning.

BastiKiel commented 9 years ago

Today I repeated various tests against the last version. Results are absolutely on par with what I measured in previous tests. I also ran a couple of tests against the oldest version just to make sure there hasn't been a major shift in general server performance - which wasn't the case.

For me that looks very good. With those results I'd dare proposing it for taking into production with our administrators without fear of my head being ripped off ;-)

the-ntf commented 9 years ago

Yay!!!!!!

Sent from the desert of the real.

On Dec 5, 2014, at 08:45, Bastian Lotsch notifications@github.com wrote:

Today I repeated various tests against the last version. Results are absolutely on par with what I measured in previous tests. I also ran a couple of tests against the oldest version just to make sure there hasn't been a major shift in general server performance - which wasn't the case.

For me that looks very good. With those results I'd dare proposing it for taking into production with our administrators without fear of my head being ripped off ;-)

— Reply to this email directly or view it on GitHub.

BastiKiel commented 9 years ago

OK. The change request is submitted. Next Wednesday we'll know more...

BastiKiel commented 9 years ago

Not quite there yet :-(

We've enabled the new module for the last two days. Unfortunately we still get have the server getting into big trouble, albeit the picture looks a tiny bit different this time (at least to me that is).

At first the Http.CurrentConnections statistics looked absolutely like every other day. With the old versions connections started to go up increasingly fast between 8.45am and 9am. With the latest version nothing suspiciously happened at that time. Connections went out of control shortly after 10am yesterday and at 11am today. Looking at today, until 11am connections kept at a pretty base curve swinging gently between 5 and 50. Then it went up to around 125 kept there for three minutes before skyrocketing until the server was finally brought down by an admin as the server couldn't be quit ordinarily. It looked like http processing had stalled for some reason, no connections finished processing, there were only connections coming in an stacking up. Yesterday we had the curve bump up to 150 but it quickly settled down to the level before for another 20 minutes - also swinging between 5 and 50. Then connections went up to more than 300. Then it looked like it started to establish a new "base curve" at around 325 connections - meaning that there was change both up as well as down. So for me that feels a bit like processing had stalled for a big amount of connections but some were still responsive so incoming traffic could be served to some extend. We didn't disabled the plugin until an hour before, so the server yesterday was and today is still running with the newest version. Both days after Domino crashed or was finally killed nothing else happend for the rest of the day. The usually base curve was/is swinging forth happily between 5 and 50 connections for as long as I stayed in the office.

I'll try to dig into the http access logs starting tomorrow. But what I could see from miscellaneous events in log.nsf there were loads of messages "Error connecting to server XXXX: Operation stopped at your request" after the "http quit" command had been issued. The main load-factor-application is indeed (also) displaying and otherwise accessing a view and documents from another database on the mentioned server.

Beside studying the logs I'll also going to get an equivalent setup in our test environment to run the load test against it.

Cheers and Merry Christmas!

the-ntf commented 9 years ago

So just to clarify again: this is with the plugin INSTALLED but without the library ENABLED on your application?

Sent from the desert of the real.

On Dec 18, 2014, at 10:47, Bastian Lotsch notifications@github.com wrote:

Not quite there yet :-(

We've enabled the new module for the last two days. Unfortunately we still get have the server getting into big trouble, albeit the picture looks a tiny bit different this time (at least to me that is).

At first the Http.CurrentConnections statistics looked absolutely like every other day. With the old versions connections started to go up increasingly fast between 8.45am and 9am. With the latest version nothing suspiciously happened at that time. Connections went out of control shortly after 10am yesterday and at 11am today. Looking at today, until 11am connections kept at a pretty base curve swinging gently between 5 and 50. Then it went up to around 125 kept there for three minutes before skyrocketing until the server was finally brought down by an admin as the server couldn't be quit ordinarily. It looked like http processing had stalled for some reason, no connections finished processing, there were only connections coming in an stacking up. Yesterday we had the curve bump up to 150 but it quickly settled down to the level before for another 20 minutes - also swinging between 5 and 50. Then connections went up to more than 300. Then it looked like it started to establish a new "base curve" at around 325 connections - meaning that there was change both up as well as down. So for me that feels a bit like processing had stalled for a big amount of connections but some were still responsive so incoming traffic could be served to some extend. We didn't disabled the plugin until an hour before, so the server yesterday was and today is still running with the newest version. Both days after Domino crashed or was finally killed nothing else happend for the rest of the day. The usually base curve was/is swinging forth happily between 5 and 50 connections for as long as I stayed in the office.

I'll try to dig into the http access logs starting tomorrow. But what I could see from miscellaneous events in log.nsf there were loads of messages "Error connecting to server XXXX: Operation stopped at your request" after the "http quit" command had been issued. The main load-factor-application is indeed (also) displaying and otherwise accessing a view and documents from another database on the mentioned server.

Beside studying the logs I'll also going to get an equivalent setup in our test environment to run the load test against it.

Cheers and Merry Christmas!

— Reply to this email directly or view it on GitHub.

BastiKiel commented 9 years ago

The answer would have been yes, but I meticulously checked each of our XPage applications on that production server after interviewing my colleagues. I'm sorry to say, but there was one I didn't know of and that has the org.openntf.domino.xsp.XspLibrary plugin enabled. It's not being used as far as I can tell, but it is enabled. Knowing that I checked the http access logs for requests to the respective URL, but didn't find a single request for those two days.

BastiKiel commented 9 years ago

Meanwhile the application in question has been updated in production. So now we really shouldn't have any application having the plugin enabled running. We even had two tests against this configuration in production (not quite planned ;-). As long as the OpenNTF API was enabled the server kept crashing as witnessed before. After disabling the plugin / feature the server again runs stable.

While looking through the logs I noticed something strange though. One of the admins had created a program document to periodically (every 30mins!) execute a dbcache flush on the server in question. I don't know if that has something to do with the problem but the execution of the flush is at least close to the time the server probably has started to run into problems and surely allows for additional workload on the server.

The program document has been disabled now and I'll hope to get our administrators to undergo another test against the reenabled API.

OpenNTF / org.openntf.domino

Massive performance hit #128