eclipse-ee4j / glassfish

Eclipse GlassFish
https://eclipse-ee4j.github.io/glassfish/
371 stars 138 forks source link

occasional session lost introduced by fixing stale data bug #15016 #15575

Closed glassfishrobot closed 12 years ago

glassfishrobot commented 13 years ago

Hello .. I think GF 3.1 has a critical bug somewhere near ReplicationManagerBase. Because there is no fixed way to reproduce the bug, I will do my best to describe the conditions under which the bug occurs.

Description: The test environment consists of a loadbalancer which is balancing requests to two gf31 instances created in a gf31 cluster. The requests are round robin balanced. When doing round robin requests to the two gf31 cluster instances (SimpleSession.war deployed), the session sometimes gets lost.

If you are doing slow requests to the loadbalancer (eg. one each second), mostly everything works fine but sometimes the session is lost. If you fire up two requests with an offset of some milliseconds (eg hammering the reload button of your browser), most times the session will be lost. The session also gets lost, by holding down the reload button to fire up lots of requests. But the session isn't lost for every request nor does it happen more frequently the more requests you send. If you hold down the browsers reload button, sometimes it happens after 10 requests - sometimes after 500 - there is no real pattern. Some tests with enabled relaxCacheVersionSemantics ( in sun-web.xml) shows that the session is not dropped but sometimes an old session will be loaded as the session version is not evaluated. When setting the property back to false, I guess, because of these old sessions the new session are created. But with enabled relaxCacheVersionSemantics our web application won't work as expected because it relies on proper session data.

Having the same test environment but switching from gf3.1 cluster to latest gf2.1 cluster everything works like expected (without the relaxCacheVersionSemantics; not sure if gf2.1 actually uses it). Requests can be fired at the cluster with nearly 0ms offset and you will have the same session (with correct session data) for every request.

This bug/behaviour was introduced with the fix of issue #15016. Before stale data checking, the session was never lost (but also session data was never syncronized when found in local cache). The bug/behaviour exists from glassfish-3.1-b34-12_23_2010.zip to glassfish-3.1-b37-01_11_2011.zip.

The problem seems to be a performance issue or a concurrency problem of the backingstore in gf31.

Since gf21 is able to handle a high load of sequential requests and gf31 doesn't, I think this should be fixed before any production recommendations of gf31. Thus this bug is marked as blocker.

If something is unclear, please ask and I'll do my best do describe further details. If you like, I can also provide you temporarly access to our test environment via email.

I'll attach two log files:

Output from SimpleSession

Executed From Instance: gf01-instance01

Sessions Example

Session ID: a6e320ea2027287900aff441a87a Session is new: true Executed From Instance: gf01-instance01 Created: Wed Jan 12 14:33:18 CET 2011 Last Accessed: Wed Jan 12 14:33:18 CET 2011 Requested Session ID: a6d938138d1e5f754a4538d56d31 Requested Session ID is valid: false Requested Session ID is from a cookie: true Requested Session ID is from a URL: false The following data is in your session: jreplicaLocation = gf02-instance01

Affected Versions

[3.1_dev]

glassfishrobot commented 5 years ago
glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: instance 01 gets request, could not load requested session id and creates a new

instance 02 is jreplicaLocation

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: SimpleSession.war attached

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: Hi jjackb, Can you post the URLs that you used to access the attached (SimpleSession) app? What's the 'think time' between two consecutive requests for the same sessionID (is it zero?)

I will see if I can reproduce this

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: Hi, if you deployed the app to /SimpleSession, you can access the Servlet with the following URL:

http:///SimpleSession/SimpleSession

And for the 'think time': There is no rule, when the bug occours. For testing, you can begin with a large 'think time' and decrease it to nearly zero. Sometimes the bug occours with up to a second 'think time'.

Please write a private mail, if you need/like to get access to our test-environment.

Thank you.

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: Hi jjackb: Can you apply the attached shoal-cache.jar to the modules directory and set the system property:

-Dorg.shoal.cache.transmitter.frequency.in.millis=10

Let me know if the above solves this issue.

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: Hi,

I replaced the jar in the modules directory on both instances and updated the cluster configuration. But it does not resolve the problem.

To clarify what happens here, I made two screencasts:

GF31_SimpleSession.m4v: Shows the SimpleSession servlet running on an gf31 cluster (with your updated jar and jvm option)

GF21_SimpleSession.m4v: Shows the SimpleSession servlet running on an gf21 cluster (default configuration)

In both casts the servlet is accessed through Safari and reloading is done by manually hitting the reload button.

As you can see, the gf21 cluster never drops the session (although requests are more frequent than to the gf31 cluster).

The gf31 instances are started with the following command:

/usr/lib/jvm/java-6-sun-1.6.0.22/bin/java cp /home/ccadmin/glassfish3/glassfish/modules/glassfish.jar -XX:+UnlockDiagnosticVMOptions -XX:MaxPermSize=192m -XX:NewRatio=2 -Xmx512m -javaagent:/home/ccadmin/glassfish3/glassfish/lib/monitor/btrace-agent.jar=unsafe=true,noServer=true -server -Dosgi.shell.telnet.maxconn=1 -Djdbc.drivers=org.apache.derby.jdbc.ClientDriver -Dfelix.fileinstall.dir=/home/ccadmin/glassfish3/glassfish/modules/autostart/ -Djavax.net.ssl.keyStore=/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01/config/keystore.jks -Dosgi.shell.telnet.port=26666 -Djava.security.policy=/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01/config/server.policy -Dfelix.fileinstall.log.level=3 -Dfelix.fileinstall.poll=5000 -Dcom.sun.aas.instanceRoot=/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01 -Dosgi.shell.telnet.ip=127.0.0.1 -Dcom.sun.enterprise.config.config_environment_factory_class=com.sun.enterprise.config.serverbeans.AppserverConfigEnvironmentFactory -Djava.endorsed.dirs=/home/ccadmin/glassfish3/glassfish/modules/endorsed:/home/ccadmin/glassfish3/glassfish/lib/endorsed -Dcom.sun.aas.installRoot=/home/ccadmin/glassfish3/glassfish -Djava.ext.dirs=/usr/lib/jvm/java-6-sun-1.6.0.22/lib/ext:/usr/lib/jvm/java-6-sun-1.6.0.22/jre/lib/ext:/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01/lib/ext -Dfelix.fileinstall.bundles.startTransient=true -Dfelix.fileinstall.bundles.new.start=true -Djavax.net.ssl.trustStore=/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01/config/cacerts.jks -Dorg.shoal.cache.transmitter.frequency.in.millis=10 -Dcom.sun.enterprise.security.httpsOutboundKeyAlias=s1as -DANTLR_USE_DIRECT_CLASS_LOADING=true -Djava.security.auth.login.config=/home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01/config/login.conf -Dgosh.args=noshutdown -c noop=true -Djava.library.path=/home/ccadmin/glassfish3/glassfish/lib:/usr/lib/jvm/java-6-sun-1.6.0.22/jre/lib/amd64/server:/usr/lib/jvm/java-6-sun-1.6.0.22/jre/lib/amd64:/usr/lib/jvm/java-6-sun-1.6.0.22/lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib com.sun.enterprise.glassfish.bootstrap.ASMain -asadmin-args --host,,,gf01,,,port,,,4848,,,secure=false,,,terse=false,,,echo=false,,,interactive=false,,,start-local-instance,,,verbose=false,,,debug=false,,,-node,,,gf02-node,,,gf02-instance01 -instancename gf02-instance01 -verbose false -debug false -asadmin-classpath /home/ccadmin/glassfish3/glassfish/modules/admin-cli.jar -asadmin-classname com.sun.enterprise.admin.cli.AsadminMain -upgrade false -type INSTANCE -instancedir /home/ccadmin/glassfish3/glassfish/nodes/gf02-node/gf02-instance01 -read-stdin true

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: Trying to reproduce this with a non sticky round robin LB.

BTW, are you using HADB in V2.1?

glassfishrobot commented 13 years ago

@glassfishrobot Commented ai109478 said: Excluding from 3.1 count. For high availability support with 3.1, we will require a sticky load balancing.

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: Hi,

I do not use HADB in v2.1. I use the default GF2.1 configuration (in memory replication).

Also I did some further investigation of the problem and here are my new findings:

First of all I used a newer nightly build (glassfish-3.1-b39-01_24_2011) but the problem still exists with SimpleSession.war. Then in addition to nginx as a load balancer I also tried HAproxy (http://haproxy.1wt.eu/) and I tested both load balancer with round robin AND sticky requests to our two glassfish instances (each one on a different physical host) in our test cluster.

So with two different load balancers and regardless of using round robin or sticky requests I occasionally get new sessions. Using the provided shoal-cache.jar with -Dorg.shoal.cache.transmitter.frequency.in.millis=10 results in the same behaviour (but with higher cpu load because of the higher frequency setting).

Then I recognized the two little images in the top right corner when accessing http://server-ip/SimpleSession/SimpleSession servlet. So each page reload effectivly fires three request to glassfish (urls are: http://server-ip/SimpleSession/images/code.gif and http://server-ip/SimpleSession/images/return.gif). Altough these additional requests will go through the load balancer they will land on the same glassfish cluster instance (regardless of round robin or sticky configuration). This is because the browser does a HTTP 1.1 request to load the page and thus using only one TCP connection for all three requests. The load balancer will then redirect all HTTP requests within the same TCP connection to the same host. But for testing I turned off image loading in my browsers (Safari and Firefox) and surprisingly the session stays the same.

So to be clear: Using SimpleSession.war and allowing the browser to download the two little icons you will loose your session occasionally regardless of using round robin requests or sticky requests and regardless of using nginx or HAproxy as load balancer. You could even access one of the two glassfish instances directly via its ip:port and you will loose your session sometimes on page reloads. So its not a load balancer issue at all. As soon as you turn off image loading in your browser you won't loose you session (maybe you will if it is possible to generate a high frequency of requests. I just emulated this by holding down the browsers refresh button which is maybe "to slow").

Next I stopped one of the two glassfish instances in the cluster. This should disable session replication on the running instance and when accessing SimpleSession.war on the remaining instance I do not loose my session. In this scenario its irrelevant if the browser loads these two little icons or if it doesn't.

During my testing I also encountered the following exception:

[#|2011-01-26T10:35:44.013+0100|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=24;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

This exception will raise from time to time if you fire some more requests to the cluster. Even if the session stays the same this exception will occure sometimes.

I attached an additional video (gf31_session_issue.m4v) to demonstrate my findings. This is what I did in the video: 1.) Accessing instance01 on host gf01 (gf01-instance01) directly through ip:port to eliminate load balancer. So its basically sticky requests. Cluster contains 2 hosts/nodes gf01 and gf02 with one instance on each node (gf01-instance01 and gf02-instance01) 2.) Browser image loading enabled, both instances running => occasional new sessions 3.) Disable browser image loading, both instances running => session stays the same 4.) Enable browser images again, both instances running => occasional new sessions 5.) Stopping gf02-instance01 to disable replication. Image loading still enabled => session stays the same 6.) Starting gf02-instance01. Image loading still enabled => occasional new sessions

Please don't take me wrong but from my investigation I think it's wrong to tag this issue with 3_1-exclude and say that GF3.1 requires sticky load balancing. It's just wrong. This issue even occures with sticky load balancing and even if you do not use a load balancer at all and just access one instance of the cluster directly as long as two instances are running in the cluster.

I hope this helps you to better understand whats going on.

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: Hi Jjacb, I tried SimpleSession.war on my setup and used apache proxy balancer. I configured the lb to be non sticky and also set the system property -Dorg.shoal.cache.transmitter.frequency.in.millis=10

With the above setup, I just could not reproduce the issue.

Could you try with -Dorg.shoal.cache.transmitter.frequency.in.millis set to 5 or lower?

If you still see the isssue, could you do the following? try deploying SimpleSession.war as follows

asadmin deploy --target= --availabilityenabled=true --asyncreplication=false SimpleSession.war

I will attaching a new shoal-cache.jar patch shortly. Please try the new shoal-cache.jar after deploying the app the above mentioned manner.

One more thing: Please dont think that by excluding this from V3.1 doesnt mean that this bug wont get any attention. Try the patch and post your response

glassfishrobot commented 13 years ago

@glassfishrobot Commented @shingwaichan said: I see the same behavior.

In the fix of http://java.net/jira/browse/GLASSFISH-1057 , StandardHostValve.preInvoke() call getSession(false) which trigger the version increment.

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: shoal-cache.jar that fixes synchronous save. Use with asadmin deploy --availabilityenabled=true --asyncreplication=false --target=

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: I tested by accessing the gif in addition to accessing the servlet iself. Everything works fine.

Looks like you need to set -Dorg.shoal.cache.transmitter.frequency.in.millis to a lowver value. Or use adadmin deploy --asyncreplication=false --availabilityenabed=true --target= SimpleSession.war

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: Hi,

even with -Dorg.shoal.cache.transmitter.frequency.in.millis set to 5 or 2, the session drop still exists for me. I also broadened my tests and deployed the SimpleSession app to a cluster on Windows 7 Pro and Mac OS X 10.6.6 (both only two local instance and without a loadbalancer). On both systems, I was able to reproduce the session drop, simply by reloading the SimpleSession servlet on the first instance.

I also tried the --asyncreplication=false option during deployment (with your updated shoal-chache.jar), but the result stays the same.

I will give you another update tomorrow with more details ..

edit: did you recognize a slowdown of the images included in the SimpleSession servlet?

glassfishrobot commented 13 years ago

@glassfishrobot Commented jjackb said: Hi,

sorry for the late response. I have been busy. I tried many things to find some instructions I can give to you to reproduce, but it still occurs randomly.

Here a short summary of my state:

Setup:

Currently I am testing on the web profile (glassfish-3.1-web-b41-02_06_2011), to exclude as many components as possible.

First thing to mention is: In the server.log of the instance the request goes to, there are still the following exceptions during a request:

[#|2011-02-07T13:38:31.239+0100|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=37;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

Maybe you can get an idea, where the TimeoutException comes from. I have absolutely no idea. The test environment is connected directly with switching hardware. My guess is some kind of a concurrency "deadlock" in the code.

I am a bit curious why you cannot reproduce the session drop on your own. I did not do any special at all (even the load-balancer is not necessary). To assist you, I attached another small app (ctestservlet.war), which does basically the same as SimpleSession, but adds some more images. So requests are send faster and more frequently to an instance. For me, the session drop occurs mostly on every second to third reload of the ctestservlet.war servlet. When I reload the servlet, I wait for all the images to load and then wait about one to two seconds to do the next request.

You can access the ctestservlet.war servlet by calling http://gf-instance/ctestservlet/CTestServlet (or whatever your context root is)

Additionally, I was wondering how the JSESSIONIDVERSION is calculated. In my understanding it should be increased on every request. In the case of the CTestServlet there are 13 requests (12 images + servlet). But as you might see, the JSESSIONIDVERSION is not increased by at least 13 on each request. It is increased in a random manner.

Thank you! Hope we will find the reason for this soon ..

glassfishrobot commented 13 years ago

@glassfishrobot Commented sfordin said: With all the back-and-forth in the comments on this issue, it is unclear to me what actually needs to be said in the docs here, and even in which doc this issue would be most appropriately discussed. Can someone please advise me here?

glassfishrobot commented 13 years ago

@glassfishrobot Commented mk111283 said: 3_1_1-scrubbed

glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: specify relaxCacheVersionSemantics in your ctestservlet.war

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN" "http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">

/ctestservlet Keep a copy of the generated servlet class java code.
glassfishrobot commented 12 years ago

@glassfishrobot Commented mk111283 said: Tested with the attached ctestservlet app after modifying the sun-web.xml. Everything ran fine (and fast too). There were no session loss Fixed in shoal. Committed revision 1732

glassfishrobot commented 12 years ago

@glassfishrobot Commented jjackb said: Tested on Mac OS X Lion

java version "1.6.0_29" Java(TM) SE Runtime Environment (build 1.6.0_29-b11-402-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02-402, mixed mode)

with promoted build: glassfish-3.1.2-b17.zip

There is no session loss with relaxCacheVersionSemantics-true during reloading ctestservlet.war in a local two instance cluster.

BUT:

Did you try to failover to another instance? On my setup, session will be/is lost.

What I did:

With gf 3.1.1, this emulated failover by changing Ports works well without session lost.

Is this a newly introduced bug or intended changeset? Please clarify. Thanks.

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: ctestservlet.war Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: GF21_SimpleSession.m4v Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: gf31_session_issue.m4v Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: GF31_SimpleSession.m4v Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: [instance 01 log.txt](http://download.oracle.com/glassfish/issue_attachment/27606/instance 01 log.txt) Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: [instance 02 log.txt](http://download.oracle.com/glassfish/issue_attachment/27607/instance 02 log.txt) Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: shoal-cache.jar Attached By: mk111283

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: shoal-cache.jar Attached By: mk111283

glassfishrobot commented 13 years ago

@glassfishrobot Commented File: SimpleSession.war Attached By: jjackb

glassfishrobot commented 13 years ago

@glassfishrobot Commented Was assigned to mk111283