RestComm / jain-sip.ha

JAIN-SIP-HA : Provides extensions done by TeleStax for high availability and fault tolerance through replication of various states of the stack. It supports Call Established Failover or Early Dialog Failover.
http://www.restcomm.com/
GNU Affero General Public License v3.0
8 stars 21 forks source link

Retransmitted OK responses generate responseEvent objects with null dialog on the server causing IllegalStateExceptions #15

Open posfaig opened 8 years ago

posfaig commented 8 years ago

When running the custom-b2bua test of Sip Servlets with enough number of calls (~50 on my machine), sometimes IllegalStateExceptions occur in the logs, and subsequent requests are not forwarded by the server in certain calls.

Based on my investigation the following happens:

If the receiver leg retransmits an OK response, and both the original and retransmitted OK responses arrive successfully to the server, then the created responseEvent object of the retransmitted OK reponse will contain null dialog value (in this method: org.mobicents.servlet.sip.core.SipApplicationDispatcherImpl.processResponse(ResponseEvent responseEvent)), which results in removing the dialog on the server and that results in the failing of forwarding subsequent requests within that dialog. In particular, IllegalStateExceptions start to occur in the logs. (Like e.g.: "java.lang.IllegalStateException: Couldn't create the subsequent request INVITE for this session...")

I came across the problem first when testing in HA mode, but with a single node and with load balancer. This test is referred to as 01. I also reproduced the error with the simple JAIN SIP (not HA, that is, in the standalone xml I set the following attribute for the sip-servlets subsystem: path-name="org.mobicents.ha.balancing.only"), but still with using the load balancer. This test is referred to as 02. Then I also reproduced with no HA functionality involved (path-name="org.mobicents.ext") and without using the LB. This test is referred to as 03.

Logs are too big to include here, so they can be downloaded from here: https://drive.google.com/open?id=0B9PoF7Up2ChSWTFwc204MXdGTzQ

Note: I added a few log messages on my local environment for debugging that are not present in any committed version. (E.g. simple java.lang.Exceptions with stack traces etc.)

( Tip: If you want to look for the problematic responses in the node's log that have null dialog value, you can look for the following text: CCCCCCC-isdialognull=true,isretrans=false I added this message in the org.mobicents.servlet.sip.core.SipApplicationDispatcherImpl.processResponse(ResponseEvent responseEvent) method for debugging. Above these messages there is another log message for debugging in the same method that writes out the whole response message starting with: "AAAAAAA error - processResponse - responseEventExt.getResponse()=". So when looking for these kind of responses one should first look for the string "CCCCCCC-isdialognull=true,isretrans=false" and then start searching backwards from there looking for the "AAAAAAA error - processResponse - responseEventExt.getResponse()=" string )

Here is the detailed description of test 01:

Components:

sender
LB ---------server node

receiver

As can be seen in the sipp result, there is one failed call. The corresponding Call-ID is ee491fb754a85e20a544f7c57e3be599@127.0.0.1

If you take a look at the trace with the following filter, you can see that the first OK gets retransmitted, and after the retransmission there are no subsequent OK messages for the failed call: sip && sip.Status-Code==200 && sip.Call-ID=="ee491fb754a85e20a544f7c57e3be599@127.0.0.1"

(If you want to look at the complete call from the sender to the receiver use the following filter: sip && (sip.Call-ID=="32-21924@127.0.0.1" || sip.Call-ID=="ee491fb754a85e20a544f7c57e3be599@127.0.0.1") Here you can also see that 500 Server Interval Error responses arrive for the subsequent requests.)

There are a few log messages in the SipApplicationDispatcherImpl.processResponse event that might be interesting: The first and the retransmitted OK message can be seen in the 443613th and 463960th lines, respectively.

Also, I added a log message in the same method, to see the value of a few things, which can be found for the two OK messages of the failed call in the 443720th and 464093rd lines.

It can be seen that the two OK responses are completely identical, however, in the second event, the retransmission flag is not set. Also when the first OK arrives, the dialog is not null, however, it is null in the responseEvent of the second OK message.

It is also interesting that there are other successful calls for which the first OK is retransmitted, like e.g. Call-ID: a72849d8b356b632fef8cc95d8e01a65@127.0.0.1 (Wireshark filter: sip && sip.Status-Code==200 && sip.Call-ID=="a72849d8b356b632fef8cc95d8e01a65@127.0.0.1") However, for these calls the first OK does not appear in the log, only the retransmitted does. That is, if I look for the OK messages with the Call-ID above, I only find 3 messages in the log (lines: 385981, 859175, 1148419), instead of 4 as it is seen in the trace. And the missing OK message is the first one (not the retransmitted.)