RestComm / ussdgateway

RestComm USSD Gateway
http://www.restcomm.com/
GNU Affero General Public License v3.0
90 stars 99 forks source link

When user abort ussd session then http connection get stuck #28

Closed Rouanvdw closed 7 years ago

Rouanvdw commented 8 years ago

When the user abort the ussd session it seems like the connection to http RA stays open and this counts as a concurrent connection. I have set the , MAX_CONNECTIONS_FOR_ROUTES to 500 and it fills up in 2 hours. Then any ussd short codes that use same url does not work until I restart the ussd gateway.

vetss commented 8 years ago

Hello @Rouanvdw

please provide HTTP payload by which "user abort the ussd session"

Rouanvdw commented 8 years ago

Hello Sergey,

So what I mean by "user abort the ussd session" when the user cancel's a ussd session the ussd gw send an abort message to the application routed to. I then on my side notice it is and abort and invalidate the http session as done in the example http servlet.

So the http payload is what the ussd gateway generates and sends.

So what I did to solves this was to use the MAX_CONNECTIONS_FOR_ROUTES as a Total_connection_for_route. Meaning I set the value to some big value like 500000. This is obviously not correct and how it should be used. That is why I think the connections get stuck because if the user uses the app and goes to a menu that generate a close dialog and invalidates the http session.

vetss commented 8 years ago

Hello @Rouanvdw

ok, I understand now that a mobile subscribe breakes a TCAP Dialog by sending on TC-ABORT (I guess it us TC-USER-ABORT, correct) message and you use only USSD PULL mode.

Which USSD GW version do you use ? How does it looks like from your USSD application that connects to USSD GW. Does your USSD application also have a lot of opened connections ?

Rouanvdw commented 8 years ago

Hello Sergey,

I download the version that was build on cloudbees. Build #32

https://mobicents.ci.cloudbees.com/job/RestComm-USSD-Gateway/lastSuccessfulBuild/artifact/

The USSD application that the GW connects to is a servlet very much in the same manner as the example HTTP servlet. See attached code.

I am running the USSD application in glashfish. I will have to have a look if I can see how many connection are open.

ussd_application.txt

Rouanvdw commented 8 years ago

Hi @vetss, I have updated the code in ussd application. see attached code.

When I load test using https://github.com/RestComm/jss7/tree/master/map/load everything runs smooth on my vm(which is my dev-enviroment) I am able to handle up to 200 dialogs-per-sec(just because my laptop can't crank it up more.

On our production enviroment that is connected to an actual ss7 network. It struggles to handle 8 dialogs-per-sec and after about 20min start throwing the the following error. java.net.SocketException: Too many open files at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) at java.net.ServerSocket.implAccept(ServerSocket.java:530) at java.net.ServerSocket.accept(ServerSocket.java:498) at org.apache.tomcat.util.net.DefaultServerSocketFactory.acceptSocket(DefaultServerSocketFactory.java:61) at org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:310) at java.lang.Thread.run(Thread.java:745)

The difference between my dev and prod enviroments are Ubunut vs CENTOS

AS Mtp3UserPart_m3ua1.xml DEV/PROD AS functionality - SGW/IPSP exchangeType - SE/DE ipspType - CLIENT/SERVER

I have also attached a image of the server traffic per second. This is not very busy but I have noticed all the failed dialogs are when users-abort. Could it be keeping connections open?

image

ussd_applicatoin_v2.txt

vetss commented 8 years ago

Hello @Rouanvdw

thanks for your updates.

Sorry I confused of the message flow that you use. I was sure that you use USSD PULL, but I can see that you report of "at java.net.ServerSocket.accept(ServerSocket.java:498)" that more looks like that this a server role that means PUSH scenario. Please explain do you use PULL (a mobile subscriber initiate a MAP dialog) or PUSH (USSD GW initiates a MAP dialog) ?

"When the user abort the ussd session it seems like the connection to http RA stays open" - why do you mean that aborts from mobile subscribers make "connection to http RA stays open". I mean may be normal dialog closing may also make such mulfunction.

Rouanvdw commented 8 years ago

Hi @vetss

Thanks for coming back to me. The user initiate the ussd session. The user then receives a menu with options "1.Airtime,2.Balance. Where the dialog was return to the customer as messageType continue. The user then aborts the sesssion by not responding and pressing exit/cancel. The ussd gateway then send a "Abort" message at which time the following message is displayed. " Wanted to fire ResponseEvent for HttpClientActivity b326a0be-f4f7-48e2-9251-70e826effe4b but activity is already ended, droping event"

The reason I said it seems the http connection stay open after this is because I had to increate my MAX_CONNECTION_PER_ROUTE when the users started aborting the sessions. And as far as I know the MAX_CONNECTION_PER_ROUTE is used the limit the max concurrrent connection on a route and it thus led me to believe that the abort is causing the sessions to remain

Rouanvdw commented 8 years ago

Hi @vetss

I have run netstat -n on the server and noticed alot of connections in the state CLOSE_WAIT. Have you experienced this issue before?

At the moment we are receiving the following error.

2016-06-19 13:30:26,811 ERROR javax.slee.RAEntityNotification[entity=HttpClientResourceAdaptor].HttpClientResourceAdaptor executeMethod failed in AsyncExecuteHttpMethodHandler with IOException java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180) at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:151) at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:125) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:640) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.mobicents.client.slee.resource.http.HttpClientResourceAdaptor$AsyncExecuteMethodHandler.run(HttpClientResourceAdaptor.java:658) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-06-19 13:30:26,812 ERROR javax.slee.SbbNotification[service=ServiceID[name=mobicents-ussdgateway,vendor=org.mobicents,version=1.0],sbb=SbbID[name=HttpClientSbb,vendor=org.mobicents,version=1.0]].USSD-Child-org.mobicents.ussdgateway.slee.http.HttpClientSbbImpl Error while processing RESPONSE event java.lang.Exception: Exception received for HTTP request sent. See traces above at org.mobicents.ussdgateway.slee.http.HttpClientSbb.onResponseEvent(HttpClientSbb.java:94) at sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.mobicents.slee.runtime.sbbentity.SbbEntityImpl.invokeEventHandler(SbbEntityImpl.java:479) at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.routeQueuedEvent(EventRoutingTaskImpl.java:379) at org.mobicents.slee.runtime.eventrouter.routingtask.EventRoutingTaskImpl.run(EventRoutingTaskImpl.java:126) at org.mobicents.slee.runtime.eventrouter.EventRouterExecutorImpl$EventRoutingTaskStatsCollector.run(EventRouterExecutorImpl.java:72) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

vetss commented 8 years ago

Hello @Rouanvdw

sorry for delayed response.

"java.net.NoRouteToHostException: Cannot assign requested address" - this looks like USSD GW can not connect to some remote host to reach a USSD application. Althow you reported that mobile subscribers have responses from USSD GW... May be you can check it ?

Can you please also comment this:

"The user then aborts the sesssion by not responding and pressing exit/cancel. 
The ussd gateway then send a "Abort" message at which time the following message is displayed.
" Wanted to fire ResponseEvent for HttpClientActivity b326a0be-f4f7-48e2-9251-70e826effe4b
but activity is already ended, droping event""

May be you have some captured trace example for it. I mean payload for both SS7 and HTTP parts for the case "when the users started aborting the sessions". It is importand to understand what happens. "The ussd gateway then send a "Abort" message" - do you mean YSSD GW sends an HTTP request to USSD application ?

"the following message is displayed" - where can you find it ? If it is present in some log, can you provide the whole log message ?

"state CLOSE_WAIT" may be leaded of some allpication wrong behaviour.

"The reason I said it seems the http connection stay open after this is because I had to increate my MAX_CONNECTION_PER_ROUTE when the users started aborting the sessions." - ok, I see your point.

Rouanvdw commented 8 years ago

Hi @vetss

I have verified. This only happens when the customer aborts the USSD session from the mobile handset. Please see attached .cap file with the packet send from the USSD gw to the USSD application when the user aborts using the handset.

This causes the socket not to close completely causing the socket to go into a CLOSE_WAIT state. If I netstat -nat | grep CLOSE_WAIT on the server I notice alot of sockets in that state. Which happends when the socket is not closed correctly. This happens with our USSD application and the example servlet code.

I have been able to reproduce it on my development environment using the simulator. When the ussd dialog is opened and the example http servlet return "Hello world 1.Balance 2.Texts Remaining" If you then click on the "Close current dialog" button it closes that dialog but not the socket. If I then run the command netstat -nat | grep CLOSE_WAIT in the terminal I can see the connection is in "CLOSE_WAIT" state and never gets closed.

Have you noticed this issue before?

userabort.zip

vetss commented 8 years ago

Hello @Rouanvdw

Have you noticed this issue before?

not noticed this issue so far.

One question - have you updated some parameters in http-client-ra-DU-*****.jar ?

Rouanvdw commented 8 years ago

Hi @vetss

I had to change 2 parameter because of this issue in the http-client-ra-DU-2.8.5.jar

MAX_CONNECTIONS_FOR_ROUTES and MAX_CONNECTIONS_TOTAL

At first I did not change this. And did not notice this problem because while developing and testing aborting was not tested. When started load testing the USSD gateway we noticed the test users aborting and this leaving connecctions open.

vetss commented 8 years ago

Hello @Rouanvdw

thanks for your update and provided info. You have said that you can reproduce the behaviour with Simulator, that is good. "CLOSE_WAIT" if I understand it properly means that TCP connection that carries HTTP trafic is not closed properly for some reasons. At the provided by you pcap trace data I can see HTTP traffic by I can not see how TCP connection was estanlished and dropped. This I guess was outside the captured time. Also you said that in the normal dialog case (a mobile subscriber does not send TC-USER-ABORT but USSD GW finishes a dialog properly by sending of processUnstructuredUSSD Reqsponse).

Is it possible to simulate both cases with tcap traces capturing ?

Rouanvdw commented 8 years ago

Hello @vetss Yes, that is what I understand of the "CLOSE_WAIT" but it only happens when the simulator sends send TC-USER-ABORT.

Please see attached the TCAP file from simulator connect to user abort. userabort.zip

vetss commented 8 years ago

Hello @Rouanvdw

I am thinking of who must close TCP connection in your case (TCP client == USSD GW or TCP server == your server). From the pcap file I can see that nobody does it. The connection has a style "Connection: Keep-Alive" and it id not broken inside the session. But after an end of a session it should be broken I guess. Should it make a server ? I guess it should either break a connection or have some "Keep-Alive" timeout for breaking it in short time.

Rouanvdw commented 8 years ago

Hello @vetss Well at the moment when the Dialog gets to state END. Meaning the user followed a path that closed the dialog correctly then the USSD gw closes the connection. I would imagine that when the customer aborts then the USSD gw will close the connection to as application server?

Could the gateway not dispose of the HTTP resource if an abort happens. THe same that it does when the path follows to a END state?

vetss commented 8 years ago

Hello @Rouanvdw

I can not understand the reason of what is happenning (: I have not found signifucant differenses between the behaviour in USSD GW side (HTTP client RA) between normal end behaviur and TC-USER-ABORT case. HTTP client RA does not break TCP connections as I understand it but keep them and reuse them for future connections and AFAIK server side must drop connections when needed. But I do not understand why we have such bad behaviour.

1) what is a TCP connection break policy at your HTTP server ?

2) if you do not uncrease MAX_CONNECTIONS_FOR_ROUTES and it become a bad behaviour, which messages do you see in logs at USSD GW (if we think that a problem is in USSD GW side we must have some bad log messages there). Logs that you have provided: a) java.net.SocketException: Too many open files at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) at java.net.ServerSocket.implAccept(ServerSocket.java:530) means a HTTP server side (ServerSocket) and may not be a USSD GW part event b) executeMethod failed in AsyncExecuteHttpMethodHandler with IOException java.net.NoRouteToHostException: Cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) looks like more routing issue, not the issue that can be removed by encreasing of MAX_CONNECTIONS_FOR_ROUTES parameter

3) may be you can capture pcap trace at HTTP part for more time that just one dialog and we will be able to see how TCP connections are established / destroyed and what is wrong in destroying part...

Rouanvdw commented 8 years ago

Hi @vetss

We have done further test and noticed if we use the simulators functionality to test multiple USSD message using the automatic sending of the message it aborts the message 100% and all connections are closed successful. Thus it can't be the Webserver we use because it would then always have kept the connections in "CLOSE_WAIT" state.

When we use the simulator to test USSD and we set client mode to "Auto sending ProcessUnstructeredRequests" then it begin's and aborts the connections correctly.Before we did the test of 200 concurrent dialogs. We ran the following command to check the amounT of connections in "CLOSE_WAIT" STATE. "netstat -nat | grep CLOSE_WAIT | wc -l" we got the value of "0". After running the automatic USSD simulator it was still "0".

But when it is set to "Manual Operation" then it begin's the connection but upon clicking on the "Close current Dialog" button it closes the dialog but keeps the connection open. We ran the command again and the connections were up to "2" in "CLOSE_WAIT" state and continued the increase after each use of the simulator in "Manual operation" mode where we begin and abort a session.

It seems the Simulator is aborting the session correctly because all connection started and aborted when using the automatic mode. But when using the manual mode in the simulator it does not close the connections correctly.

Maybe this is the first place we must check? What is the difference when aborting the dialog using the "Automatic" versus "Manual" mode.

vetss commented 8 years ago

Hello @Rouanvdw

we need to differentiate SCTP (SS7, m3ua) connections and TCP (HTTP) connections. You are reporting that MAX_CONNECTIONS_FOR_ROUTES parameter was helping you and this parameter affects to TCP/HTTP part. When you are reporting this: "upon clicking on the "Close current Dialog" button it closes the dialog but keeps the connection open" - which connection do you mean SCTP/m3ua or HTTP/TCP ?

I think the differense between "Automatic" and "Manual" modes are that "Automatic" mode does not send TC-USER-ABORT but manually you can simulate TC-USER-ABORT dialog ending. So this comes back to your theory that only TC-USER-ABORT initiate bad behaviur of HTTP part (if I understnd you properly). Can you please check my prevous message for my quieries ?

Rouanvdw commented 8 years ago

Hi @vetss

Its all HTTP/TCP connections that get stuck.

I will give you the feedback on the other stuff you requested. ASAP

Thanks for all you assistants thus far.

vetss commented 7 years ago

Hello @Rouanvdw

I can not still reproduce it locally. If you have any idea hwo to fix your issue please share it.

nhanth87 commented 7 years ago

hi @vetss I think I can reproduce this problem.

01:35:16,891 WARN  [HttpClientSbbImpl] (SLEE-EventRouterExecutor-5-thread-1) Application didn't revert in 25000 milliseconds. Sending back dialogtimeouterrmssg for MAPDialog MAPDialogSupplementaryWrapper [wrappedDialog=MAPDialog: LocalDialogId=79 RemoteDialogId=79 MAPDialogState=INITIAL_RECEIVED MAPApplicationContext=MAPApplicationContext [Name=networkUnstructuredSsContext, Version=version2, Oid=0, 4, 0, 0, 1, 0, 19, 2, ] TCAPDialogState=InitialReceived]

Br, TN

sleemov commented 7 years ago

Hello Dears I think I found the real cause of the issue. It is the http-client not releasing HTTP connections. This behavior is due to a bug in Apache HC 4.3. It has already been fixed in HC 4.4a1. As of 4.4 CloseableHttpClient#close should automatically shut down the connection pool only if exclusively owned by the client

Please check this link for more details.

The solution is to update the dependency version of the Apache HC in the http-client-ra-DU-2.8.5.jar file or change the dependency to the latest http-client-ra (which is now 7.something)

nhanth87 commented 7 years ago

Hi @Mohammed-Sleem, We dont have the exception connection pool is shutdown (because we dont use CloseableHttpClient), so it may be another problem.

We have a fix here: https://github.com/RestComm/jain-slee.http/commit/a4d77a2c755c808335ba3b4d50940e1ad323549e

Some people are testing it and it works great but we need more times to prove (problem happen after long run 1 month).

Can you please help to test?

Br, TN

sleemov commented 7 years ago

Hello nhanth87

Thanks for your quick response. I will check the solution and get back to you. Regarding the exception on pool shutdown, I have seen it in my case when I stop the server. We don't use the CloseableHttpClient in the RA but you can see it referenced in the stack trace. Please check the attached for details on the exception ST. log.txt

nhanth87 commented 7 years ago

Hi @Mohammed-Sleem , Thanks for your log, can you open a new ticket for http-RA?

btw I think difficult to move to 4.4 because of some of codes are incompatible (some method are Deprecated)

Br, TN

sleemov commented 7 years ago

Hi @nhanth87 I have checked out the branch of the fix but when I mvn install I got jar files with version 7!! and the files deployed in JBoss have the version 2.8.14 (http-client-ra-DU-2.8.14.jar)

Note: I use the latest release of ussd gateway (restcomm-ussd-7.0.41)

BR Sleem

nhanth87 commented 7 years ago

Hi @Mohammed-Sleem , It's OK. when Telestax release new Restcomm version, they will tag it. Just backup the old http-client and replace with new one

Br, TN

sleemov commented 7 years ago

Hello @nhanth87 I have been trying the fix you mentioned, when I clone/mvn install I got the v7 jar files what I did:

  1. undeployed the file http-client-ra-DU-2.8.14.jar from JBOSS_HOME/server/default/deploy
  2. deployed the new RA by copying the file http-client-ra-DU-7.0.0-SNAPSHOT.jar to the same deploy directory and it was deployed fine.

But after that I think the jss7 couldn't communicate with the http RA.

My questions:

  1. Is there any difference between the 2 files? I discovered inside JARs and coudn't figure it out
  2. how can I change debug level of the ussd gateway HTTP client invoker? as I am not seeing anything in the logs, only the time-out CDR.
sleemov commented 7 years ago

I just tried RestComm/jain-slee.http/releases/tag/2.9.129 and the issue looks disappeared. the connection is closing gracefully and doesn't stuck in CLOSE_WAIT state anymore. I will test this for some days and feedback. Thanks very much

vetss commented 7 years ago

Hello Mohammed,

When @nhanth87 was saying for "when Telestax release new Restcomm version" he means this binaries. You can try them to be sure. https://mobicents.ci.cloudbees.com/view/JSLEE/job/Restcomm-JAIN-SLEE-Release/93/ It is needed to replace only http-client-ra-DU-****.jar to a new one.

Can we close the issue ?

sleemov commented 7 years ago

Hello @nhanth87

For me it apparently solved my issues. +1 to close the issue. Thanks

vetss commented 7 years ago

I am closing the issue now.