High Availability (HA) webapps slow, corrupted sessions, and java.util.concurrent.TimeoutException

glassfishrobot commented 12 years ago

- - See Thread http://www.java.net/forum/topic/glassfish/glassfish/glassfish-31-final-high-availability-web-apps-slow-and-loses-session-state?force=856#comment-819381 ***

I set up a cluster, and deployed my JSP application onto it. It works great until I turn on high-availability for this application via the Admin console. Once I do that, it becomes very slow, and session state gets lost every 2 requests or so. Disabling high-availability cures the problem.

I did run verity_multicast, GMS is running, cluster health is good, followed the documentation, and didn't do anything 'weird' or customized'. I also have in my web application.

Environment

Linux, 8GB RAM, 8-core Intel CPU, Cluster of 2 machines.
Both session loss and slowness coincide directly with the TimeoutException
The app used is our internal app, we are having trouble to reproducing this wit cluster.jsp directly
aside from the --distributable-- directive, there is no tuning in web.xml, there is no glassfish-web.xml at all
application is deployed from the Admin GUI, with no changes in any of the checkboxes, aside from the 'availability
availability is set at deployment time, not after
no relaxVersionSemantics property
session loss occurs frequently but not always, but always there is shoal TimeoutException in the logs that corresponds to session loss
session size is around 50k
cluster has 2 nodes, both are full (not virtual) machines
There is no traffic (test server) just sitting trying to use the app with one browser
The issue happens whether you use a load balancer or not, even when hitting the server directly,

although it's much easier to reproduce with a sticky-session load balancer (apache/mod-proxy-ajp)

Affected Versions

[3.1.1]

glassfishrobot commented 5 years ago

Issue Imported From: https://github.com/javaee/glassfish/issues/17504
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @yaminikb
Original Issue Closed By:@glassfishrobot

glassfishrobot commented 12 years ago

@glassfishrobot Commented shreedhar_ganapathy said: Couple of questions : 1. Have you tied using any of the supported LBs with sticky sessions enabled i.e. Apache with mod_jk (we dont support mod_proxy ajp yet although I suspect this may not be contributing to the issue) or try Oracle Iplanet Web Server/Apache/IIS with GlassFish LB Plugin

2. Does your app employ Ajax calls? Ajax based request responses may result in request version numbers within sessions to be incremented incrementally before a given request has completed replication and returned - as a result, this may result in sessions not be found for that incremented request version number causing a new session to be created. In order to work around this, you have to place the relaxCacheVersionSemantics property in the glassfish-web.xml descriptor.

Here's a snippet

Let us know if any of the above resolves/reproduces the issue with more information. At our end we are trying to reproduce with our apps but cannot reproduce.

If its possible to share your app, that would also help.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: I tried relaxCacheVersionSemantics=true code, but it did not work.

I did not try the load balancer, but that isn't even involved. When I hit Glassfish server directly without any load balancer, the issue still exists. I cannot share the application, because it is very database-driven and I can't give access to that for obvious reasons.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: You said that there are two machines involved. When you use a browser, typically, the cookies (for example JSESSIONID) will NOT be sent to the second machine. This could be the reason why session failover may not be working for you. (However, this wont be an issue if you use an LB)

I suggest you setup an LB and try your app. OR you can create a cluster of two instances that run on the same machine. If you have the instances running on the same server instance (same machine), you can use a browser to jump from one instance to another.

hope this helps.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Believe me, the load balancer is not the issue. This is in no way related to the load balancer. I tried all kinds of setup with no results. with load balancer, without, trying to isolate the problem is how I created non load balance test.

In my non load balancer Tests I just hit only one instance so replication wasn't even used, but the session loss was still there and the timeout exception too. The timeout exception is the key he and needs to be found and fixed.

This issue is going on in multiple environments and reported by multiple users. This is not an invalid issue. This is not an environmental issue. This is a bug in glassfish and shoal in particular. This is not an operator error. This has been going for more than a year with lots of people trying lots of things to fix it with no results.

glassfishrobot commented 12 years ago

@glassfishrobot Commented tushuu said: I observed the same behavior with logs indicating LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException. I have deployed a 2 instance cluster spanning two separate physical hosts. Instances keep on logging such WARNING messages. I have fronted the cluster with Apache mod_jk LB.

glassfishrobot commented 12 years ago

@glassfishrobot Commented michalkurtak said: Hello. We are observing same problem. We have 2 node cluster with 4 instances (2 instances on one node). Cluster is very very slow. It is obvious when static content (e.g. images) is served in parallel from glassfish servers. 120B images are served in 3-4 seconds. We have haproxy with sticky session loadbalancer in front of cluster. So requests arrive on same instance and session is lost.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: shoal replication module currently, doesn't handle the case when there are no replication partners. It just attempts to replicate data even if there are no replica instances running. I have a patch that fixes the issue. Will be available in the next shoal promotion.

Regarding TimeoutException, it (TimeoutException) is thrown only when load requests (to load a session from replica) fails to load within a reasonable time. This is not the root cause itself. The root cause is to identify why the sessions are not found.

Had a discussion with the web container team and it looks like there is a race condition when AJax calls are involved. I am working on the fix for this as well.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Can you elaborate a bit further on this? I have two instances in the cluster, shouldn't they be the required replication partners? Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said:

Can you elaborate a bit further on this? I have two instances in the cluster, shouldn't they be the required replication partners? Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie. If you have two instances then they are discovered and one will act as a replica for the other. You mentioned that there could be parallel threads accessing the same session. This is what exactly AJAX type applications do. In this case, the web container will issue a bunch of save (or updateTimeStamp) calls to the replication module in parallel for the same session ID. Either the web container and / or the replication module need to handle concurrent saving of same sessions properly. This issue is exactly same as 17344

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Is there a Glassfish plugin or a version that I can test with now to see if this is really true? I remember that the Shoal team cannot reproduce the problem right now, and I would love to confirm that this indeed is the cause of the problem. Thanks!

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: There is no plugin to test this. We are trying to provide a fix for this issue in 3.1.2. Can you please reproduce the issue using your app on 3.1.2 (using the latest nightly build).

I am currently testing a patch that fixes this issue. Once the patch is ready and integrated, you pick up the next available promoted build of 3.1.2 to test it.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: I have the patch ready (using GlassFish 3.1.2 trunk).

If you have your tests setup on 3.1.2, I can post the patch.

Will checkin after code review

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Can you post a binary release somewhere so I don't have to compile, apply patch and download? I never built from source and would like to avoid it if possible. Thanks

glassfishrobot commented 12 years ago

@glassfishrobot Commented jjackb said: I am also interested in a binary release to test the patch because this might be the solution to this bug: http://java.net/jira/browse/GLASSFISH-15575 -> reported in early 2011 during gf 3.1 beta-testing and describes the same problem and log entries -> problem still exists with gf 3.1.1 in production environment

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: The promoted builds are available at:

http://dlc.sun.com.edgesuite.net/glassfish/3.1.2/promoted/

Please wait for the next promotion

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: You also need to update your sun-web.xml with the following content:

Note that the manager-properties contains: relaxCacheVersionSemantics

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN" "http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">

/ctestservlet

Keep a copy of the generated servlet class java code.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: I am going to close this issue. If you update the sun-web.xml with the relaxCacheVersionSemantics, you should see a considerable increase in performance.

Without the relaxCacheVersionSemantics, there were too many load requests to load the session from replica instance causing a considerable delay in loading the page.

I will keep issue number 17344 open though (http://java.net/jira/browse/GLASSFISH-17344)

glassfishrobot commented 12 years ago

@glassfishrobot Commented @barchetta said: From Mahesh:

What is the impact on the customer of the bug? Moderate.
How likely is it that a customer will see the bug and how serious is the bug? Customers who are using AJAX will face this issue.
Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)? Yes. 2.x handled AJAX related calls well
What is the cost/risk of fixing the bug? Moderate

How risky is the fix? How much work is the fix? Is the fix complicated? Moderate. I had to touch 9 files (all in) failover / replication module. The fix is straightforward but had to touch 9 files

Is there an impact on documentation or message strings? No changes to docs required. Since 2.x documentation already talks about AJAX related settings that must be specified in glass fish-web.xml
Which tests should QA (re)run to verify the fix did not destabilize GlassFish? SQE HA tests. These tests have been run with the patch and all passed.
Which is the targeted build of 3.1.2 for this fix? Next build

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: Tested with the ctestservlet mentioned in 15575. The real issue here is that the app uses multiple gifs/jpegs that causes a browser to make concurrent requests to the server. Due to the absence of relaxVersionSemantics in sun-web.xml, the web container makes approximately 7 load_requests to the replication layer for every page access!

Some of the load_requests were lost because we do batching (using a map) based on sessionid.

I have fixed the loss of load_requests with fix to shoal (commit version 1732). After adding the relaxVersionSemantics to the app, there were no session loss.

cluster has 2 nodes, both are full (not virtual) machines. There is no traffic (test server) just sitting trying to use the app with one browser I would like to add that if there are multiple physical machines, you have to use a load balancer otherwise jsessionid cookie will not be automatically sent by the browser. This has nothing to with replication or web container. This is how browsers work.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @jfialli said: Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012. Fix should be in next promoted build which is 4.0 b19.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @jfialli said: Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012. Fix should be in next promoted build which is 4.0 b19.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Looks like this is confirmed fixed now. Thanks a lot for your efforts. I didn't even need to do this: and it still works great!

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Looks the replication problems are not fixed in 3.1.2b20, Some session attributes are getting lost, seemingly being overwritten by another node in the cluster with older data. The TimeoutExceptinos and slow performance are fixed though.

I opened another issue regarding this: http://java.net/jira/browse/GLASSFISH-18322

glassfishrobot commented 12 years ago

@glassfishrobot Commented File: server_logs.zip Attached By: lprimak

glassfishrobot commented 12 years ago

@glassfishrobot Commented Was assigned to mk111283

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA GLASSFISH-17504

glassfishrobot commented 12 years ago

@glassfishrobot Commented Reported by lprimak

glassfishrobot commented 12 years ago

@glassfishrobot Commented Marked as fixed on Friday, January 6th 2012, 8:53:27 am

eclipse-ee4j / glassfish

High Availability (HA) webapps slow, corrupted sessions, and java.util.concurrent.TimeoutException #17504

Environment

Affected Versions