eclipse-ee4j / glassfish

Eclipse GlassFish
https://eclipse-ee4j.github.io/glassfish/
383 stars 143 forks source link

<BT6597311>AS91EE:Start-instance faied because of connection to mq broker times out #3662

Closed glassfishrobot closed 12 years ago

glassfishrobot commented 17 years ago

*READ-ONLY Data from Bugtraq***** Inside SWAN :http://swsblweb1.central.sun.com:8080/CrPrint?id=6597311 Outside SWAN :http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6597311 ****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Description After a clean shutdownstartup of a remote instance failed Build : as91EE rc3 build Attached instance logs and broker log

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation The appserver log tries to connect to broker at 05:27:21 [#|2007-08-24T05:27:21.656-0700|WARNING|sun-appserver9.1|javax.jms|_ThreadID=10;_ThreadName=main;_RequestID=09f1c981-1253-451d-b5d1-624f93bb6168;|[C4003]: Error occurred on connection creation [localhost:37677]. - cause: java.net.ConnectException: Connection refused|#]

The broker starts up at 05:27:29 [24/Aug/2007:05:27:29 PDT] WARNING [B1262]: Restricted JMS service until communication with master broker is established.

After this the ra does not throw java.net.ConnectException: Connection refused but throws an exception while reading from the imqChannel

[#|2007-08-24T05:27:29.249-0700|WARNING|sun-appserver9.1|javax.jms|_ThreadID=15;_ThreadName=iMQReadChannel-7;_RequestID=b6d86803-9165-43d1-b27f-97be72085a8c;|[I500]: Caught JVM Exception: java.io.EOFException|#]

But the MQ logs does not indicate that an Admin connection has been accepted. Usually when an admin connection is established with the broker there is a log message like this, "[B1065]: Accepting: admin@127.0.0.1:58292->admin:58271. Count: service=1 broker=1" But this is never printed in the MQ log file, I think, though the broker log file shows that the broker portmapper has started, it could still be in a position where it is unable to accept connections.

The best way to ascertain this would be to connect to the broker during this phase through a jms client and check the response, but QE does not have this setup and the issue is not reproducible consistently, and there is no obvious info in the log which indicates a problem., it is quite difficult to debug the problem right now with no setup and with the existing log file alone.

Creating an MQ CR 6597988 for this, Requesting MQ team to evaluate this, and give their inputs on the state of the broker at the time when the ra tries to connect to it.

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation From the log files - it appears that the broker may not be completely up at the point the error occured. The "Restricted JMS" service happens at the same second as the client side exception occured. My guess is that the admin connection failed because the jms service was not yet active (e.g. the connection occured a couple of milliseconds before the service went active) From looking @ the other entries in the broker log -

This means that the admin connection is created immediately after the broker is started and anything that causes the broker to take slightly longer to start will cause a failure.

If possible, the admin connection should either delay a second or so or retry once in the case of a failure.

We can not verify that this is the cause without some sort of reproducable test case.


1. This is about AS instance starting up a 'LOCAL' broker (not remote) 2. 'Restricted JMS service' does not affect admin connection

Since the problem occurred during AS starting up a local broker, unlikely the 'EOFException's were resulted from AS/RA operated any restricted operations on a NORMAL connection (#3).

In an isolated test using MQ/RA LocalBrokerRunner.start()/stop() which repeated start/stop a 9-broker cluster didn't reproduce the EOFException problem in more than 50 iterations.

Run a simple test program (on iasqalx4) that start/stop 1 broker instance repeatly (20 sec sleep in between) for ~2 days and reproduced the problem. The test program (SimpleAdmin.java, attached) uses same code to start/stop broker as MQ/RA LocalBrokerRunner.start()/stop().

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation To help debugging the issue, please configure appserver and broker with the following configurations and re-run the test. Thanks. 1. Configure app server to include the following JVM properties in app server's domain.xml file:

-Dimq.debug -Dimq.packetType=10|11

The above options should be added to the app server that failed to re-start MQ RA.

2. Please add the following imqbrokerd args options (not JVN options) when starting up MQ broker:

-debug pkt

The args options should be added in the /imq/bin/imqbrokerd script.

For example, in imqbrokerd script, append -debug pkt at the following statement:

  1. The command we use to run the broker _brokercmd="$imq_javahome/bin/java $imq_servervm -cp $_classpath $_jvm_args $_ma inclass $_args -debug pkt"

3. Please increase the jms-service start timeout to 5 minutes (default 1 minute):

See http://docs.sun.com/source/819-0079/dgjms.html#wp19009

use asadmin set/get to set the start-timeout

e.g.

./asadmin get server.jms-service.init-timeout-in-seconds

will show its current value ( = 60 )

./asadmin set server.jms-service.init-timeout-in-seconds=300

to set it to 5 minutes

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation Vivek reported the problem occured today. The broker instance that couldn't connectto is running on iasqalx4. Once the EOFException occurred when RA making the ADMIN connection to the broker, the problem will persist until broker shutdown or restarts.

The EOFException occurred while client runtime read fixed header of hello-reply packet. The EOFException only occurs when making ADMIN connection to the broker on iasqalx4 where the broker is running and only occurs when using 'localhost' to connect to the broker, imqBrokerHostName=localhost

imqcmd can connect to the broker instance if specify the real hostname (iasqalx4) or imqcmd run on a different machine (using same version of Java 1.5.0_12 on a similar SunOS 5.10 x86 machine).

see detailed information in Comments.

This looks like a problem below 'application' layer - e.g. TCP stack or OS. I'v suggested to check the patches on the machine first. Please try the workaround by using the real hostname instead of 'localhost' when connecting to broker.

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation The simple test program (SimpleAdmin.java attached) that start/stop 1 broker instancerepeatly (25sec sleep in between) ran on iasqalx4 2 times both reproduced the problem (in ~2 days and in ~1 day respectively); The same program ran on a similar system mqlx4 for ~5 days without problem (use same Java version jdk1.5.0_12).

Broker debug shows that for the EOFException, ServerSocket.accept didn't even has a return. The problem lines under broker layer on iasqalx4 specifically. Please check iasqalx4 patches and also comparing with mqlx4 (in same domain as iasqalx4).

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Justification Priority changed from [] to [3-Medium]instance startup XXXXXX 2007-08-26 00:34:43 GMT

Priority changed from [3-Medium] to [4-Low] Discussed in the 4:00 pm HA meeting and it was agreed that this can be lowered to a P4 since it is very hard to reproduce. Will be investigated for the update XXXXXX 2007-08-30 19:04:43 GMT

****READ-ONLY Data from Bugtraq Ends**

Environment

Operating System: All Platform: All

Affected Versions

[9.1peur1]

glassfishrobot commented 6 years ago
glassfishrobot commented 17 years ago

@glassfishrobot Commented gfbugbridge said: *READ-ONLY Data from Bugtraq***** Inside SWAN :http://swsblweb1.central.sun.com:8080/CrPrint?id=6597311 Outside SWAN :http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6597311 ****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Description After a clean shutdownstartup of a remote instance failed Build : as91EE rc3 build Attached instance logs and broker log

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation The appserver log tries to connect to broker at 05:27:21 [#|2007-08-24T05:27:21.656-0700|WARNING|sun-appserver9.1|javax.jms|_ThreadID=10;_ThreadName=main;_RequestID=09f1c981-1253-451d-b5d1-624f93bb6168;|[C4003]: Error occurred on connection creation [localhost:37677]. - cause: java.net.ConnectException: Connection refused|#]

The broker starts up at 05:27:29 [24/Aug/2007:05:27:29 PDT] WARNING [B1262]: Restricted JMS service until communication with master broker is established.

After this the ra does not throw java.net.ConnectException: Connection refused but throws an exception while reading from the imqChannel

[#|2007-08-24T05:27:29.249-0700|WARNING|sun-appserver9.1|javax.jms|_ThreadID=15;_ThreadName=iMQReadChannel-7;_RequestID=b6d86803-9165-43d1-b27f-97be72085a8c;|[I500]: Caught JVM Exception: java.io.EOFException|#]

But the MQ logs does not indicate that an Admin connection has been accepted. Usually when an admin connection is established with the broker there is a log message like this, "[B1065]: Accepting: admin@127.0.0.1:58292->admin:58271. Count: service=1 broker=1" But this is never printed in the MQ log file, I think, though the broker log file shows that the broker portmapper has started, it could still be in a position where it is unable to accept connections.

The best way to ascertain this would be to connect to the broker during this phase through a jms client and check the response, but QE does not have this setup and the issue is not reproducible consistently, and there is no obvious info in the log which indicates a problem., it is quite difficult to debug the problem right now with no setup and with the existing log file alone.

Creating an MQ CR 6597988 for this, Requesting MQ team to evaluate this, and give their inputs on the state of the broker at the time when the ra tries to connect to it.

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation From the log files - it appears that the broker may not be completely up at the point the error occured. The "Restricted JMS" service happens at the same second as the client side exception occured. My guess is that the admin connection failed because the jms service was not yet active (e.g. the connection occured a couple of milliseconds before the service went active) From looking @ the other entries in the broker log -

This means that the admin connection is created immediately after the broker is started and anything that causes the broker to take slightly longer to start will cause a failure.

If possible, the admin connection should either delay a second or so or retry once in the case of a failure.

We can not verify that this is the cause without some sort of reproducable test case.


1. This is about AS instance starting up a 'LOCAL' broker (not remote) 2. 'Restricted JMS service' does not affect admin connection

Since the problem occurred during AS starting up a local broker, unlikely the 'EOFException's were resulted from AS/RA operated any restricted operations on a NORMAL connection (#3).

In an isolated test using MQ/RA LocalBrokerRunner.start()/stop() which repeated start/stop a 9-broker cluster didn't reproduce the EOFException problem in more than 50 iterations.

Run a simple test program (on iasqalx4) that start/stop 1 broker instance repeatly (20 sec sleep in between) for ~2 days and reproduced the problem. The test program (SimpleAdmin.java, attached) uses same code to start/stop broker as MQ/RA LocalBrokerRunner.start()/stop().

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation To help debugging the issue, please configure appserver and broker with the following configurations and re-run the test. Thanks. 1. Configure app server to include the following JVM properties in app server's domain.xml file:

-Dimq.debug -Dimq.packetType=10|11

The above options should be added to the app server that failed to re-start MQ RA.

2. Please add the following imqbrokerd args options (not JVN options) when starting up MQ broker:

-debug pkt

The args options should be added in the /imq/bin/imqbrokerd script.

For example, in imqbrokerd script, append -debug pkt at the following statement:

  1. The command we use to run the broker _brokercmd="$imq_javahome/bin/java $imq_servervm -cp $_classpath $_jvm_args $_ma inclass $_args -debug pkt"

3. Please increase the jms-service start timeout to 5 minutes (default 1 minute):

See http://docs.sun.com/source/819-0079/dgjms.html#wp19009

use asadmin set/get to set the start-timeout

e.g.

./asadmin get server.jms-service.init-timeout-in-seconds

will show its current value ( = 60 )

./asadmin set server.jms-service.init-timeout-in-seconds=300

to set it to 5 minutes

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation Vivek reported the problem occured today. The broker instance that couldn't connectto is running on iasqalx4. Once the EOFException occurred when RA making the ADMIN connection to the broker, the problem will persist until broker shutdown or restarts.

The EOFException occurred while client runtime read fixed header of hello-reply packet. The EOFException only occurs when making ADMIN connection to the broker on iasqalx4 where the broker is running and only occurs when using 'localhost' to connect to the broker, imqBrokerHostName=localhost

imqcmd can connect to the broker instance if specify the real hostname (iasqalx4) or imqcmd run on a different machine (using same version of Java 1.5.0_12 on a similar SunOS 5.10 x86 machine).

see detailed information in Comments.

This looks like a problem below 'application' layer - e.g. TCP stack or OS. I'v suggested to check the patches on the machine first. Please try the workaround by using the real hostname instead of 'localhost' when connecting to broker.

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Evaluation The simple test program (SimpleAdmin.java attached) that start/stop 1 broker instancerepeatly (25sec sleep in between) ran on iasqalx4 2 times both reproduced the problem (in ~2 days and in ~1 day respectively); The same program ran on a similar system mqlx4 for ~5 days without problem (use same Java version jdk1.5.0_12).

Broker debug shows that for the EOFException, ServerSocket.accept didn't even has a return. The problem lines under broker layer on iasqalx4 specifically. Please check iasqalx4 patches and also comparing with mqlx4 (in same domain as iasqalx4).

****READ-ONLY Data from Bugtraq Ends** *READ-ONLY Data from Bugtraq***** Justification Priority changed from [] to [3-Medium]instance startup XXXXXX 2007-08-26 00:34:43 GMT

Priority changed from [3-Medium] to [4-Low] Discussed in the 4:00 pm HA meeting and it was agreed that this can be lowered to a P4 since it is very hard to reproduce. Will be investigated for the update XXXXXX 2007-08-30 19:04:43 GMT

****READ-ONLY Data from Bugtraq Ends**

glassfishrobot commented 17 years ago

@glassfishrobot Commented @sivakumart said: Assigning to Ramesh as the source issue is assigned to him

glassfishrobot commented 16 years ago

@glassfishrobot Commented rampsarathy said: This is an MQ bug and no work required from glassfish, Assigning to Satish to follow-up and close

glassfishrobot commented 12 years ago

@glassfishrobot Commented tmueller said: Bulk update to change fix version to "not determined" for all issues still open but with a fix version for a released version.

glassfishrobot commented 12 years ago

@glassfishrobot Commented liang.x.zhao said: I can not reproduce it against GlassFish 3.1.2.

glassfishrobot commented 17 years ago

@glassfishrobot Commented Was assigned to liang.x.zhao

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA GLASSFISH-3662

glassfishrobot commented 17 years ago

@glassfishrobot Commented Reported by gfbugbridge

glassfishrobot commented 12 years ago

@glassfishrobot Commented Marked as fixed on Wednesday, March 7th 2012, 11:07:46 pm