[JRockit-intermittent] ClassCastException observed in instance logs when running lb tests (com.sun.grizzly.config.ContextRootInfo cannot be cast to org.apache.catalina.Context)

glassfishrobot commented 12 years ago

GF build 14 Setup: Cluster with 10 instances Platform: OEL6 with JRockit Has this passed before: OEL + JRockit combination was last used in GF 3.1 where the LB tests did pass and the ClassCastExceptions were not seen.

When running the HA - LB tests we found that some of the tests failed with the response - "400 Bad Request". All of the tests that failed were found to have one of the instance's containing the below ClassCastException. The exception does not seem to appear on the passed tests (checked on a few of the passed tests). No other information could be obtained from the logs to indicate any other cause for the response. This is still only a guess that the below exception was causing the bad response.

The exception appears about 10 to 20 seconds after the app is successfully deployed. It appeared only on instance107 (see attached logs from one of the failed tests - 10 instance set up). Now, curiously, mostly of the tests that failed showed successful requests served by instance 102, 103, 104 & 105, and then would throw the bad response when LB was to forward the subsequent request to instance106.

[#|2011-12-17T02:45:48.500-0800|SEVERE|glassfish3.1.2|org.apache.catalina.connector.CoyoteAdapter|_ThreadID=56;_ThreadName=Thread-2;|PWC3989: An exception or error occurred in the container during the request processing java.lang.ClassCastException: com.sun.grizzly.config.ContextRootInfo cannot be cast to org.apache.catalina.Context at org.apache.catalina.connector.CoyoteAdapter.postParseRequest(CoyoteAdapter.java:534) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:271) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:231) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:174) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:849) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:746) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1045) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:228) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59) at com.sun.grizzly.ContextTask.run(ContextTask.java:71) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513) at java.lang.Thread.run(Thread.java:662)

| #] |

Please send a mail if you like to view logs from other failed tests.

Affected Versions

[3.1.2_dev]

glassfishrobot commented 6 years ago

Issue Imported From: https://github.com/javaee/glassfish/issues/18060
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @yaminikb
Original Issue Closed By:@glassfishrobot

glassfishrobot commented 12 years ago

@glassfishrobot Commented @shingwaichan said: The corresponding code in CoyoteAdapter is as follows:

Object context = request.getMappingData().context;
        if (context instanceof ContextRootInfo) {
            // this block of code will be invoked when an AJP request is intended
            // for an Adapter other than the CoyoteAdapter
            final Adapter toInvoke = ((ContextRootInfo) context).getAdapter();
            // Ensure the Adapter isn't the ContainerMapper.  It could be there
            // is only one container/adapter currently active.  If this is the
            // case, it could cause recursion and blow the stack.
            if (!"com.sun.enterprise.v3.services.impl.ContainerMapper".equals(toInvoke.getClass().getName())) {
toInvoke.service(req, res);
toInvoke.afterService(req, res);
return false;
            }    
        }

        Context ctx = (Context) context;

This is introduced in rev 40454 for fixing http://java.net/jira/browse/GLASSFISH-12458 ("EJB Webservices not available via mod_jk listener").

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said:

What is the impact on the customer of the bug? Potential issue with load balancing.

How likely is it that a customer will see the bug and how serious is the bug? Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)? Given the test scenario, most likely will be seen.

What is the cost/risk of fixing the bug? Cost is low. The code in question may be removed now that grizzly provides AJP services. Hack is no longer needed.

How risky is the fix? How much work is the fix? Is the fix complicated? Low. Revert the code back to the state before the change was introduced. Not complex - it's a single block.

Is there an impact on documentation or message strings? No.
Which tests should QA (re)run to verify the fix did not destabilize GlassFish? The tests that found the issue would be fine.
Which is the targeted build of 3.1.2 for this fix? 17

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Changes applied (3.1.2: 51860; 4.0: 51862).

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Issue exists with build 17. Tests were run on OEL6 with JRockit.

The stack trace has one difference - the line of code from which the exception is thrown:

[#|2012-01-11T09:49:24.039-0800|SEVERE|glassfish3.1.2|org.apache.catalina.connector.CoyoteAdapter|_ThreadID=70;_ThreadName=Thread-2;|PWC3989: An exception or error occurred in the container during the request processing java.lang.ClassCastException: com.sun.grizzly.config.ContextRootInfo at org.apache.catalina.connector.CoyoteAdapter.postParseRequest(CoyoteAdapter.java:519) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:271) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:231) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:174) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:849) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:747) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:1046) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:228) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:105) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:91) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:79) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:56) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59) at com.sun.grizzly.ContextTask.run(ContextTask.java:71) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513) at java.lang.Thread.run(Thread.java:662)

| #] |

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Shing-Wai, I removed the code that was added as a workaround for the old ajp listener implementation. It appears that it wasn't responsible for the issue.

Passing back to you as you're more familiar with the web container.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Is it possible to get the full server log for the instance that is failing? Looking through what was originally attached, the log was truncated (i.e., all activity prior to application deployment isn't present).

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Working though the code now, however, if it is possible to provide a scaled-down reproduction, it would help us nail this down.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Instrumented Jars

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: I've attached instrumented jars to aid debugging in the QE environment. These JARs are specifically for b17. Please install them into the modules directory of the server(s) and run the tests again.

I'll need the full log of the server that fails the test.

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: In the attached logs, the ClassCastException appears in instances instance102 & instance109. The logs for those can be seen under 18160-logs-b17/st-cluster

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Attached logs obtained from one of the failed test cases.

Since the issue is intermittent and we haven't been able to identify a fixed set of steps to reproduce the issue yet, we will need to re-run a number of LB tests to be able to reproduce the issue using the instrumented jars.

Will provide logs obtained while using the instrumented jars after running the tests again.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Yep, understand the difficulties. But if you could reduce the cluster size to say two and still reproduce it, it would be much simpler on this end to deal with.

Will keep an eye out for the instrumented logs.

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Attaching logs from a test run that saw the ClassCastException. These logs have been generated from an instrumented build 17. Please see the logs for instances: instance102, instance105 or instance107 under

Ryan: Do we still need logs from a 2 instance cluster ?

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Contrary to what was reported in the description of this bug, this exception has now been observed to appear on some of the test that passed. QE needs to further analyze the tests to check if the failed tests have a different or additional cause. Its also possible that the tests that passed did not get affected by the exception due to the nature of the scenarios contained in those tests.

QE will continue to analyze tests that show this exception.

Ryan: Could you help explain the effect that this exception may have on new/subsequent requests that land on an affected instance ? Thanks.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: No need for logs for a two instance cluster. I was more curious if the issue could be resolved in such a scenario. If it was the case, then it would be likely I could repeat on my machine.

I'll review the logs and follow up with the question regarding the impact of the CCE.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Ok, in the cast of the CCE, the response will be a 200 with no content. So the CCE isn't causing the 400.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: So looking through the code, there are several cases where a 400 may be send as a response to a request.

if the request is HTTP/1.1 but there is no Host header.
invalid characters in the host header value
if the runtime is unable to parse the request URI
if the header is so large that it overflows the buffer (in this case, severe messages are written to the log, so this isn't the case here).

It would be useful to use something like ngrep (or whatever your tool of choice is) to capture the traffic between the loadbalancer and the cluster instances so we can see the actual request content that is triggering the issue.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Added further instrumentation to web-core.jar. Please run the tests with the new jar.

Thanks, -rl

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Thanks for the details Ryan.

Working on this now. Will update the bug once we have reproduced the issue with the new jar. Will also try and capture more information regarding the 400 Bad Request and file a separate bug.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Please use this web-core.jar instead.

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Attaching logs with the latest instrumented jars. Please see st-cluster/instance101/logs/server.log for the ClassCastException.

Also, updated Summary as the 400 Bad Request does not seem to appear due to the ClassCastException. Will investigate more and file a separate bug for that. The ClassCastException appears in the logs of tests that have passed and 400 Bad Request seems to show up after the setup has been up and running for some hours and then the lb tests are run.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Ok, narrowed further. Further instrumentation to follow. Thanks for the help!

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Ok, there's a total of three jars now. All go in the modules directory.

For the server that exhibits the issue, I will need the full log (including startup).

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Attached instance startup logs and those of a test (CCE appears in the logs of instance102 and instance107)

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: The CCE reported in issue 17006 is similar to what is seen in 18060.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Would it be possible to include the log segment where deployment of the test application occurs on instance 102?

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Instance102's logs are at test-logs/st-cluster/instance102/logs/server.log

The client logs in ant.output show the following deploy message at time 1:15:32: [testng] Jan 19, 2012 1:15:32 AM com.sun.dft.glassfish.utils.Utility logCommandOutput [testng] INFO: Command Executed at agent machine agent1: /space/gf-ha/glassfish3/bin/asadmin --user admin deploy --availabilityenabled=true --force=true --target st-cluster /space/gf-ha/agent-repository/glassfish-samples/clusterjsp.ear [testng] Output : Application deployed with name clusterjsp. [testng] Command deploy executed successfully.

Ryan: Can you help provide some details on the information we are trying to gather through the instrumented code ? Would it help if the setup is made available ?

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: Trying to trace the creation of particular mapper instances (at which point in time, etc) to try to narrow down why/when the incorrect mapper is used.

If you can give me access to the setup, that would be useful, as well as details on how to reproduce.

glassfishrobot commented 12 years ago

@glassfishrobot Commented @rlubke said: If it would be possible to gain access to the setup over the weekend, that would be great.

glassfishrobot commented 12 years ago

@glassfishrobot Commented varunrupela said: Hi Ryan:

I just sent a mail to you with details of the setup and how to reproduce the issue.

good luck !!

eclipse-ee4j / glassfish

[JRockit-intermittent] ClassCastException observed in instance logs when running lb tests (com.sun.grizzly.config.ContextRootInfo cannot be cast to org.apache.catalina.Context) #18060

Affected Versions