eclipse-ee4j / glassfish

Eclipse GlassFish
https://eclipse-ee4j.github.io/glassfish/
371 stars 138 forks source link

High Availability (HA) webapps slow, corrupted sessions, and java.util.concurrent.TimeoutException #17504

Closed glassfishrobot closed 12 years ago

glassfishrobot commented 12 years ago

I set up a cluster, and deployed my JSP application onto it. It works great until I turn on high-availability for this application via the Admin console. Once I do that, it becomes very slow, and session state gets lost every 2 requests or so. Disabling high-availability cures the problem.

I did run verity_multicast, GMS is running, cluster health is good, followed the documentation, and didn't do anything 'weird' or customized'. I also have in my web application.

There are no errors in the log files. When I turn on high-availability, I do get this error very frequently: [#|2011-03-06T02:13:00.297-0500|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=27;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

Environment

although it's much easier to reproduce with a sticky-session load balancer (apache/mod-proxy-ajp)

Affected Versions

[3.1.1]

glassfishrobot commented 5 years ago
glassfishrobot commented 12 years ago

@glassfishrobot Commented shreedhar_ganapathy said: Couple of questions : 1. Have you tied using any of the supported LBs with sticky sessions enabled i.e. Apache with mod_jk (we dont support mod_proxy ajp yet although I suspect this may not be contributing to the issue) or try Oracle Iplanet Web Server/Apache/IIS with GlassFish LB Plugin

2. Does your app employ Ajax calls? Ajax based request responses may result in request version numbers within sessions to be incremented incrementally before a given request has completed replication and returned - as a result, this may result in sessions not be found for that incremented request version number causing a new session to be created. In order to work around this, you have to place the relaxCacheVersionSemantics property in the glassfish-web.xml descriptor.

Here's a snippet

Let us know if any of the above resolves/reproduces the issue with more information. At our end we are trying to reproduce with our apps but cannot reproduce.

If its possible to share your app, that would also help.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: I tried relaxCacheVersionSemantics=true code, but it did not work.

I did not try the load balancer, but that isn't even involved. When I hit Glassfish server directly without any load balancer, the issue still exists. I cannot share the application, because it is very database-driven and I can't give access to that for obvious reasons.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: You said that there are two machines involved. When you use a browser, typically, the cookies (for example JSESSIONID) will NOT be sent to the second machine. This could be the reason why session failover may not be working for you. (However, this wont be an issue if you use an LB)

I suggest you setup an LB and try your app. OR you can create a cluster of two instances that run on the same machine. If you have the instances running on the same server instance (same machine), you can use a browser to jump from one instance to another.

hope this helps.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Believe me, the load balancer is not the issue. This is in no way related to the load balancer. I tried all kinds of setup with no results. with load balancer, without, trying to isolate the problem is how I created non load balance test.

In my non load balancer Tests I just hit only one instance so replication wasn't even used, but the session loss was still there and the timeout exception too. The timeout exception is the key he and needs to be found and fixed.

This issue is going on in multiple environments and reported by multiple users. This is not an invalid issue. This is not an environmental issue. This is a bug in glassfish and shoal in particular. This is not an operator error. This has been going for more than a year with lots of people trying lots of things to fix it with no results.

glassfishrobot commented 12 years ago

@glassfishrobot Commented tushuu said: I observed the same behavior with logs indicating LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException. I have deployed a 2 instance cluster spanning two separate physical hosts. Instances keep on logging such WARNING messages. I have fronted the cluster with Apache mod_jk LB.

glassfishrobot commented 12 years ago

@glassfishrobot Commented michalkurtak said: Hello. We are observing same problem. We have 2 node cluster with 4 instances (2 instances on one node). Cluster is very very slow. It is obvious when static content (e.g. images) is served in parallel from glassfish servers. 120B images are served in 3-4 seconds. We have haproxy with sticky session loadbalancer in front of cluster. So requests arrive on same instance and session is lost.

We have this message in logs: [#|2011-11-15T16:12:37.526+0100|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=47;_ThreadName=Thread-2;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: shoal replication module currently, doesn't handle the case when there are no replication partners. It just attempts to replicate data even if there are no replica instances running. I have a patch that fixes the issue. Will be available in the next shoal promotion.

Regarding TimeoutException, it (TimeoutException) is thrown only when load requests (to load a session from replica) fails to load within a reasonable time. This is not the root cause itself. The root cause is to identify why the sessions are not found.

Had a discussion with the web container team and it looks like there is a race condition when AJax calls are involved. I am working on the fix for this as well.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Can you elaborate a bit further on this? I have two instances in the cluster, shouldn't they be the required replication partners? Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said:

Can you elaborate a bit further on this? I have two instances in the cluster, shouldn't they be the required replication partners? Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie. If you have two instances then they are discovered and one will act as a replica for the other. You mentioned that there could be parallel threads accessing the same session. This is what exactly AJAX type applications do. In this case, the web container will issue a bunch of save (or updateTimeStamp) calls to the replication module in parallel for the same session ID. Either the web container and / or the replication module need to handle concurrent saving of same sessions properly. This issue is exactly same as 17344
glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Is there a Glassfish plugin or a version that I can test with now to see if this is really true? I remember that the Shoal team cannot reproduce the problem right now, and I would love to confirm that this indeed is the cause of the problem. Thanks!

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: There is no plugin to test this. We are trying to provide a fix for this issue in 3.1.2. Can you please reproduce the issue using your app on 3.1.2 (using the latest nightly build).

I am currently testing a patch that fixes this issue. Once the patch is ready and integrated, you pick up the next available promoted build of 3.1.2 to test it.

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: I have the patch ready (using GlassFish 3.1.2 trunk).

If you have your tests setup on 3.1.2, I can post the patch.

Will checkin after code review

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Can you post a binary release somewhere so I don't have to compile, apply patch and download? I never built from source and would like to avoid it if possible. Thanks

glassfishrobot commented 12 years ago

@glassfishrobot Commented jjackb said: I am also interested in a binary release to test the patch because this might be the solution to this bug: http://java.net/jira/browse/GLASSFISH-15575 -> reported in early 2011 during gf 3.1 beta-testing and describes the same problem and log entries -> problem still exists with gf 3.1.1 in production environment

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: The promoted builds are available at:

http://dlc.sun.com.edgesuite.net/glassfish/3.1.2/promoted/

Please wait for the next promotion

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: You also need to update your sun-web.xml with the following content:

Note that the manager-properties contains: relaxCacheVersionSemantics

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN" "http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">

/ctestservlet Keep a copy of the generated servlet class java code.
glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: I am going to close this issue. If you update the sun-web.xml with the relaxCacheVersionSemantics, you should see a considerable increase in performance.

Without the relaxCacheVersionSemantics, there were too many load requests to load the session from replica instance causing a considerable delay in loading the page.

I will keep issue number 17344 open though (http://java.net/jira/browse/GLASSFISH-17344)

glassfishrobot commented 12 years ago

@glassfishrobot Commented @barchetta said: From Mahesh:

How risky is the fix? How much work is the fix? Is the fix complicated? Moderate. I had to touch 9 files (all in) failover / replication module. The fix is straightforward but had to touch 9 files

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: Tested with the ctestservlet mentioned in 15575. The real issue here is that the app uses multiple gifs/jpegs that causes a browser to make concurrent requests to the server. Due to the absence of relaxVersionSemantics in sun-web.xml, the web container makes approximately 7 load_requests to the replication layer for every page access!

Some of the load_requests were lost because we do batching (using a map) based on sessionid.

I have fixed the loss of load_requests with fix to shoal (commit version 1732). After adding the relaxVersionSemantics to the app, there were no session loss.

cluster has 2 nodes, both are full (not virtual) machines. There is no traffic (test server) just sitting trying to use the app with one browser I would like to add that if there are multiple physical machines, you have to use a load balancer otherwise jsessionid cookie will not be automatically sent by the browser. This has nothing to with replication or web container. This is how browsers work.
glassfishrobot commented 12 years ago

@glassfishrobot Commented @jfialli said: Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012. Fix should be in next promoted build which is 4.0 b19.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @jfialli said: Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012. Fix should be in next promoted build which is 4.0 b19.

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Looks like this is confirmed fixed now. Thanks a lot for your efforts. I didn't even need to do this: and it still works great!

glassfishrobot commented 12 years ago

@glassfishrobot Commented lprimak said: Looks the replication problems are not fixed in 3.1.2b20, Some session attributes are getting lost, seemingly being overwritten by another node in the cluster with older data. The TimeoutExceptinos and slow performance are fixed though.

I opened another issue regarding this: http://java.net/jira/browse/GLASSFISH-18322

glassfishrobot commented 12 years ago

@glassfishrobot Commented File: server_logs.zip Attached By: lprimak

glassfishrobot commented 12 years ago

@glassfishrobot Commented Was assigned to mk111283

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA GLASSFISH-17504

glassfishrobot commented 12 years ago

@glassfishrobot Commented Reported by lprimak

glassfishrobot commented 12 years ago

@glassfishrobot Commented Marked as fixed on Friday, January 6th 2012, 8:53:27 am