Bilal-S / iis2tomcat

AJP Connector between Internet Information Services (IIS) and Apache Tomcat
http://www.boncode.net/boncode-connector
49 stars 32 forks source link

Thread behavior difference between Windows Server 2008 and 2016 #80

Closed xouqoa closed 5 years ago

xouqoa commented 5 years ago

We've been having an issue with our production servers where they randomly become unresponsive. I've been trying to help troubleshoot it with but I've come up with very little conclusive evidence for the cause.

The symptom is that a usage/request spike will come in on a server and the thread count (seen in Fusion Reactor) will spike. Usually it will not recover from this. IIS will still serve static pages, but any requests to cfml pages will hang and eventually get killed by IIS after a period of time. The only way to recover from this state is to restart the Lucee service.

On a hunch, we had our data center team set up a Windows 2008 server and we put it into load. The thread behavior (again, in Fusion Reactor) is completely different.

Windows Server 2008 / Boncode 1.0.0.16 (pretty constant thread count)

image

Windows Server 2016 / Boncode 1.0.0.41 (max thread count increases over time)

image

My theory is that when the thread count gets too high, and a large spike of requests comes in, Lucee goes kaput. What I'm not sure of is where the issue lives. I do know that we have two different production servers right now with full load running the same code, and we're seeing very different thread behavior which seems strange at least.

The W2008 server currently only has 10 ajp-nio-8009-exec-x threads created, but the W2016 one(s) always have many more than that.

I'm happy to provide more detailed information as far as versions and such, but I wasn't sure what would be useful. Both servers are running Lucee 5.2.9.31.

Any ideas?

Bilal-S commented 5 years ago

Jason: In general the connection thread count should come down. The only reason should be that they remain active for some reason. You can try to introduce a timeout for connections on Tomcat and IIs sides.

What configuration are you using on the connections? server.xml and BonCodeAJP13.settings (please remove modcfml security keys and IPs before posting).

A quick test is to set MaxConnection setting on BonCode to zero. This will force each connection to close. It has a little extra overhead, but you should be able to see a difference.

xouqoa commented 5 years ago

In general the connection thread count should come down. The only reason should be that they remain active for some reason.

It does come down eventually (if the server doesn't crash during the day) but it takes ~8 hours or so before we start seeing the overall thread count start to reduce.

You can try to introduce a timeout for connections on Tomcat and IIs sides.

How would this be done? Our data center guys might know, but I am not sure what you mean. (maybe this is the MaxConnections=0 you mentioned)

A quick test is to set MaxConnection setting on BonCode to zero. This will force each connection to close. It has a little extra overhead, but you should be able to see a difference.

We haven't tried the MaxConnection=0 thing yet, but we will and I'll report back once we can observe it for a day or so.

What configuration are you using on the connections? server.xml and BonCodeAJP13.settings (please remove modcfml security keys and IPs before posting).

We have a few different versions of our set up right now in order to try and figure out what is going on, so I'll give a short description of the environment and then the relevant settings.

Windows Server 2016 / Lucee 5.2.9.31 / Boncode 1.0.0.41

BonCodeAJP13.settings looks like this:

`

localhost 8009 True False False False HTTP_X_FORWARDED_FOR 5000 65536

`

server.xml

Windows Server 2016 / Lucee 5.2.9.31 / Boncode 1.0.0.16

We tried an older version of Boncode (which was on the 2008 server) as a test, but connections were still stepping up. The server.xml file is the same as above, but it does have a BonCodeAJP13.settings file that looks like this:

`

8009 localhost 200 0 c:\temp 0 False

`

Bilal-S commented 5 years ago

Jason: The setting file for BonCoDe 1.0.16 seems off. It does not have a PacketSize deceleration and it would not match the non-default of 65536 size in server.xml.

Once you have tried <MaxConnections>0</MaxConnections> we can look at other factors.

For example to have both IIS and Tomcat shutdown errand connections. You do this by setting Application Pool Idle Timeout on the IIS side and connectionTimeout and keepAliveTimeout on Tomcat side.

Best, Bilal

xouqoa commented 5 years ago

Hey Bilal,

Sorry, I completely forgot to follow up with you about this. (Busy, busy.. you know how it goes, I'm sure!)

We tried different versions of Boncode and didn't notice any improvement, so I think we've mostly eliminated it as a possible cause. We're still stable on Windows 2008, and I believe also on 2012.

We are currently exploring FusionReactor as a potential culprit I believe.

Bilal-S commented 5 years ago

ok no problem. closing for now.