Open GoogleCodeExporter opened 9 years ago
I should point out a couple other testing scenarios I tried prior to reporting:
1. I tested on Samsung Galaxy S (GT-I9000T) running Eclair 2.1 update 1, occurs
equally on WiFi and 3G
2. I tested on MyTouch 4G (HTC Glacier) running Froyo 2.2, occurs less
frequently on 3G and difficult to reproduce borken sockets on WiFi
Original comment by rich.ho...@gmail.com
on 4 Feb 2011 at 7:35
[deleted comment]
[deleted comment]
Using EITHER the precompiled build or a custom Doubango stack I found the
problem. It's Doubango's STUN implementation breaking the Android native
sockets intermittently.
Leave STUN off, sockets are stable...
Turn STUN on, and eventually you will get the trace below...
When will Doubango have a STUN implementation that has a NAT pinhole keepalive
& not break the Android native sockets like below? (let me know if I am asking
the wrong forum, be happy to help fix/test corrections...)
02-04 21:29:03.871: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961):
Stack stoped
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961):
realm='sip:share.aylus.com', impu='sip:#####@foo.bar', impi='#####@foo.bar'
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961):
STUN=yes
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961):
STUN2 - server=numb.viagenie.ca and port=3478
02-04 21:29:04.541: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961):
pcscf-host='#.#.#.#', pcscf-port='5060', transport='UDP', ipversion='ipv4'
02-04 21:29:04.551:
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): 26.180.59.20
02-04 21:29:04.551: INFO/org.doubango.imsdroid.Services.Impl.SipService(1961):
Local IP='26.180.59.20'
02-04 21:29:04.551:
DEBUG/org.doubango.imsdroid.Services.Impl.NetworkService(1961): netType=0 and
netSubType=8
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961):
Stack started
02-04 21:29:04.561: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961):
OnDialogEvent (Dialog connecting)
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Main(1961): onRegistrationEvent
02-04 21:29:04.561: INFO/org.doubango.imsdroid.Screens.ScreenHome(1961):
onRegistrationEvent
02-04 21:29:04.571:
INFO/org.doubango.imsdroid.Services.Impl.ServiceManager(1961):
onRegistrationEvent
02-04 21:29:04.571:
DEBUG/org.doubango.imsdroid.Services.Impl.ServiceManager(1961):
showRegistartionNotif
02-04 21:29:04.671: DEBUG/org.doubango.imsdroid.Services.Impl.SipService(1961):
OnDialogEvent ((un)REGISTER request successfully sent.)
02-04 21:29:04.741: DEBUG/skia(1961): ===== ashmem pin_region(43) returned -1,
treating as error 9
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function:
"tnet_get_ip_n_port()"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_utils.c"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "920"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: TNET_GET_SOCKADDR has failed
with status code: -1
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function:
"tnet_transport_sendto()"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: sendto have failed.
02-04 21:29:05.171: ERROR/tinyWRAP(1961): ***ERROR: function:
"tnet_transport_sendto()"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): file: "src/tnet_transport_poll.c"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): line: "241"
02-04 21:29:05.171: ERROR/tinyWRAP(1961): MSG: (SYSTEM)NETWORK ERROR ==>Bad
file number
Original comment by rich.ho...@gmail.com
on 5 Feb 2011 at 2:42
I found the problem in Doubango 'src/tnet_transport_poll.c, after line 671'.
The answer to your FIXME question is 'NO', if ioctl for FIONREAD returns 0 or
greater but the &len is set to '0' bytes, an Android native socket decriptor
has NOT been closed. It's still a viable socket descriptor and SHOULD NOT be
closed, even if it needed to be closed, Doubango should fire off an EVENT_ERROR
for the IMSDroid to decide what to do. EVENT_CLOSE confuses IMSDroid and keeps
trying to register, creating later socket errors on a dead socket. For an
interim fix since this error is slowing me down here is what I modified in the
Doubango tree for you to consider. My tweak sends the warning still, disables
the event_close and leaves the native Android socket open still. I will let you
know if I have anymore problems with this approach but it appears to be
working...
==== src/tnet_transport_poll.c, around line 671 =====
if(!len){
TSK_DEBUG_WARN("IOCTLT returned '0' bytes for fd=%d", active_socket->fd);
/*FIXME: On Android/MAC OS X this mean that the socket has been closed? */
//#if defined(ANDROID) || 1
// TSK_RUNNABLE_ENQUEUE(transport, event_error, transport->callback_data,
active_socket->fd);
// removeSocket(i, context);
//#else
recv(active_socket->fd, 0, 0, 0);
//#endif
continue;
Original comment by rich.ho...@gmail.com
on 5 Feb 2011 at 8:13
[deleted comment]
[deleted comment]
One other concern you may want to consider adding. Since Doubango is polling
and pausing sockets, a thread optimization in the polling section to sleep
150ms (or less) while the socket is paused, saves battery, CPU and uneeded
debug trace lines. I used to see "Socket is paused" for up to 50 cycles waiting
for STUN, now I see "Socket is paused, sleeping for 150ms" just once:
================
/* check whether the socket is paused or not */
if(active_socket->paused){
TSK_DEBUG_INFO("Socket is paused, sleeping for 150ms");
tsk_thread_sleep(150);
goto TNET_POLLIN_DONE;
}
Original comment by rich.ho...@gmail.com
on 6 Feb 2011 at 12:55
You have provided a great feedback but I think there are many confusions.
We are not polling and pausing the sockets as you may think. The framework uses
poll() [http://linux.about.com/library/cmd/blcmdl2_poll.htm] function which is
blocking and works like select(). If you prefer using select() instead of
poll() then, set "TNET_HAVE_POLL" macro value to zero. Please see:
http://code.google.com/p/doubango/source/browse/trunk/tinyNET/src/tnet_poll.c
"Pausing" a socket doesn't mean what you think. A socket is only paused when
STUN is enabled.
Explication:
a) When a RTP socket is created then, we add it to the map managed by the main
transport thread (T1) using poll() or select().
b) When STUN is enabled then, a new NAT context is created to retrieve the
public ip:port. This context will create a new thread (T2) to send/receive and
parse STUN (or TURN) messages.
c) As T1 and T2 use the same socket then, both will receive notifications when
there is data to read. As you cannot read the socket two times then, we mark T1
as paused. In your case this means that you have received 50 RTP/STUN packets
before STUN process complete.
It's really not a good idea to add the sleep(150).
Original comment by boss...@yahoo.fr
on 6 Feb 2011 at 3:51
Agreed I did look into the poll function and see the standard 'select() on
managed file descriptors' and agree that is the correct approach (didn't mean
to imply Doubango was using a 'polling' non-yielding thread approach).
More specifically what I am pointing out is that while a socket is 'paused' by
STUN, Doubango gets about 50 or so 'INFO' messages for select() alerts. That
was the slight inefficieny I was referring to - understand it only happens
while waiting for STUN. I was suggesting that a yield or sleep of 150 msec
(similar to what is done while waiting for DNS) might remove the unhandled
select() alerts.
I also went ahead and implemented a periodic STUN refresh for Win32 and
Android. I believe it's reasonably portable to iOS and Linux. I will attach it
kindly for your review/consideration later today, cheers.
Original comment by rich.ho...@gmail.com
on 6 Feb 2011 at 2:55
Any news about the patch?
Original comment by boss...@yahoo.fr
on 10 Feb 2011 at 12:05
My apologies been on the road for 3 weeks and traveling again next week...
Attached is a proposal to consider for STUN refresh on just the SIP transport.
For carriers that NAT the mobile devices on 3G, this keeps the NAT UDP pinholes
open to keep the device reachable by the serving CSCF.
You can ignore the function:
TINYNET_API unsigned long tnet_transport_gettick();
I am resuing the existing tsk_time_epoch() which is already portable across O/S.
Original comment by rich.ho...@gmail.com
on 12 Feb 2011 at 2:47
Attachments:
@rich.ho...@gmail.com
Is it possible to have the changes as patches. It's hard to find the modified
code.
Original comment by boss...@yahoo.fr
on 19 May 2011 at 5:17
Original comment by boss...@yahoo.fr
on 19 May 2011 at 8:58
I'm seeing this even when STUN is off.
Original comment by kjeremy@gmail.com
on 30 May 2013 at 8:17
Original issue reported on code.google.com by
rich.ho...@gmail.com
on 4 Feb 2011 at 7:27