dvoina / imsdroid

Automatically exported from code.google.com/p/imsdroid
0 stars 0 forks source link

intermittent socket failures #166

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. w/ latest IMSDroid code, I toggle registration off & on
2. intermittently I get socket errors, see debug trace below

What is the expected output? 

No socket errors

What do you see instead?

Just toggling registration, no changes in cellular or WLAN data coverage (WiFi, 
3G or HSPA+). The problem can occur both on WiFi or cellular data. Then 
inexplicably register/deregister will start working again.

What version of the product are you using? On what operating system?

Latest version IMSDroid, latest version of Doubango compiled with NEON=yes (I 
am using my own build of Doubango stack, will go back and try the public 
version checked into SVN)

Please provide any additional information below.

Here are 2 traces, I am happy to provide my tinyWrap.so build, if that would 
help isolate. It seems like the network connectivity state is related.

01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_get_ip_n_port()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_utils.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "920" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: sendto have failed.
01-25 12:01:18.676: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.676: ERROR/tinyWRAP(5327): MSG: (SYSTEM)NETWORK ERROR ==>Socket 
operation on non-socket
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_get_ip_n_port()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_utils.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "920" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: sendto have failed.
01-25 12:01:18.725: ERROR/tinyWRAP(5327): ***ERROR: function: 
"tnet_transport_sendto()" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): file: "src/tnet_transport_poll.c" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): line: "241" 
01-25 12:01:18.725: ERROR/tinyWRAP(5327): MSG: (SYSTEM)NETWORK ERROR ==>Socket 
operation on non-socket

I did add some more trace to get the value of socket file descriptor. Here is 
what the trace outputs now:

02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_utils.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "924" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1, fd = 45
02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: sendto have failed.
02-04 02:09:01.478: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.478: ERROR/tinyWRAP(3084): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number
02-04 02:09:01.521: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): file: "src/tnet_utils.c" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): line: "924" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1, fd = 45
02-04 02:09:01.521: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.521: ERROR/tinyWRAP(3084): MSG: sendto have failed.
02-04 02:09:01.523: ERROR/tinyWRAP(3084): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): file: "src/tnet_transport_poll.c" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): line: "241" 
02-04 02:09:01.523: ERROR/tinyWRAP(3084): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number

Original issue reported on code.google.com by rich.ho...@gmail.com on 4 Feb 2011 at 7:27

GoogleCodeExporter commented 9 years ago
I should point out a couple other testing scenarios I tried prior to reporting:

1. I tested on Samsung Galaxy S (GT-I9000T) running Eclair 2.1 update 1, occurs 
equally on WiFi and 3G

2. I tested on MyTouch 4G (HTC Glacier) running Froyo 2.2, occurs less 
frequently on 3G and difficult to reproduce borken sockets on WiFi

Original comment by rich.ho...@gmail.com on 4 Feb 2011 at 7:35

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Using EITHER the precompiled build or a custom Doubango stack I found the 
problem. It's Doubango's STUN implementation breaking the Android native 
sockets intermittently.

Leave STUN off, sockets are stable...

Turn STUN on, and eventually you will get the trace below...

When will Doubango have a STUN implementation that has a NAT pinhole keepalive 
& not break the Android native sockets like below? (let me know if I am asking 
the wrong forum, be happy to help fix/test corrections...)

02-04 21:29:03.871: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Stack stoped
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
realm='sip:share.aylus.com', impu='sip:#####@foo.bar', impi='#####@foo.bar'
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
STUN=yes
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
STUN2 - server=numb.viagenie.ca and port=3478
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
pcscf-host='#.#.#.#', pcscf-port='5060', transport='UDP', ipversion='ipv4'
02-04 21:29:04.551: 
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): 26.180.59.20
02-04 21:29:04.551: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Local IP='26.180.59.20'
02-04 21:29:04.551: 
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): netType=0 and 
netSubType=8
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
Stack started
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
OnDialogEvent (Dialog connecting)
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Main(1961): onRegistrationEvent
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Screens.ScreenHome(1961): 
onRegistrationEvent
02-04 21:29:04.571: 
INFO/org.doubango.imsdroid.Services.Impl.ServiceManager(1961): 
onRegistrationEvent
02-04 21:29:04.571: 
DEBUG/org.doubango.imsdroid.Services.Impl.ServiceManager(1961): 
showRegistartionNotif
02-04 21:29:04.671: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961): 
OnDialogEvent ((un)REGISTER request successfully sent.)
02-04 21:29:04.741: DEBUG/skia(1961): ===== ashmem pin_region(43) returned -1, 
treating as error 9
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_get_ip_n_port()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_utils.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "920" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: TNET_GET_SOCKADDR has failed 
with status code: -1
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: sendto have failed.
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function: 
"tnet_transport_sendto()" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241" 
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: (SYSTEM)NETWORK ERROR ==>Bad 
file number

Original comment by rich.ho...@gmail.com on 5 Feb 2011 at 2:42

GoogleCodeExporter commented 9 years ago
I found the problem in Doubango 'src/tnet_transport_poll.c, after line 671'. 
The answer to your FIXME question is 'NO', if ioctl for FIONREAD returns 0 or 
greater but the &len is set to '0' bytes, an Android native socket decriptor 
has NOT been closed. It's still a viable socket descriptor and SHOULD NOT be 
closed, even if it needed to be closed, Doubango should fire off an EVENT_ERROR 
for the IMSDroid to decide what to do. EVENT_CLOSE confuses IMSDroid and keeps 
trying to register, creating later socket errors on a dead socket. For an 
interim fix since this error is slowing me down here is what I modified in the 
Doubango tree for you to consider. My tweak sends the warning still, disables 
the event_close and leaves the native Android socket open still. I will let you 
know if I have anymore problems with this approach but it appears to be 
working...

==== src/tnet_transport_poll.c, around line 671 =====
    if(!len){
        TSK_DEBUG_WARN("IOCTLT returned '0' bytes for fd=%d", active_socket->fd);
/*FIXME: On Android/MAC OS X this mean that the socket has been closed?  */
//#if defined(ANDROID) || 1 
//      TSK_RUNNABLE_ENQUEUE(transport, event_error, transport->callback_data, 
active_socket->fd);
//      removeSocket(i, context);
//#else
        recv(active_socket->fd, 0, 0, 0);
//#endif
        continue;

Original comment by rich.ho...@gmail.com on 5 Feb 2011 at 8:13

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
One other concern you may want to consider adding. Since Doubango is polling 
and pausing sockets, a thread optimization in the polling section to sleep 
150ms (or less) while the socket is paused, saves battery, CPU and uneeded 
debug trace lines. I used to see "Socket is paused" for up to 50 cycles waiting 
for STUN, now I see "Socket is paused, sleeping for 150ms" just once:

================
    /* check whether the socket is paused or not */
    if(active_socket->paused){
        TSK_DEBUG_INFO("Socket is paused, sleeping for 150ms");
        tsk_thread_sleep(150);
        goto TNET_POLLIN_DONE;
    }

Original comment by rich.ho...@gmail.com on 6 Feb 2011 at 12:55

GoogleCodeExporter commented 9 years ago
You have provided a great feedback but I think there are many confusions.

We are not polling and pausing the sockets as you may think. The framework uses 
poll() [http://linux.about.com/library/cmd/blcmdl2_poll.htm] function which is 
blocking and works like select(). If you prefer using select() instead of 
poll() then, set "TNET_HAVE_POLL" macro value to zero. Please see: 
http://code.google.com/p/doubango/source/browse/trunk/tinyNET/src/tnet_poll.c

"Pausing" a socket doesn't mean what you think. A socket is only paused when 
STUN is enabled. 
Explication:
a) When a RTP socket is created then, we add it to the map managed by the main 
transport thread (T1) using poll() or select(). 
b) When STUN is enabled then, a new NAT context is created to retrieve the 
public ip:port. This context will create a new thread (T2) to send/receive and 
parse STUN (or TURN) messages.
c) As T1 and T2 use the same socket then, both will receive notifications when 
there is data to read. As you cannot read the socket two times then, we mark T1 
as paused. In your case this means that you have received 50 RTP/STUN packets 
before STUN process complete.
It's really not a good idea to add the sleep(150).

Original comment by boss...@yahoo.fr on 6 Feb 2011 at 3:51