NSPR should be able to handle error -3155 [ifixoldmacs.com, tenfourfox.com] (GoDaddy)

GoogleCodeExporter commented 9 years ago

There are multiple people complaining about this issue, which I was finally 
able to replicate on my folks' wireless. This seems to be socket related, which 
is problematic because the proxy on the Floodgap internal network hides the 
problem and it works normally (so the workaround is to use a proxy), meaning 
debugging will be difficult. However, we are hosted on GoDaddy, so this is now 
Critical for 9.3.0 (9.2.3 if I find a fix sooner). The problem is intermittent, 
occuring about 5 or 6 times out of 10.

This problem also affects WaMCom and iCab, but NOT IE 5. It does not affect 
later versions of Mozilla, of course.

In NSPR logging, when the dropped connection manifests, we get this:

3[8043cd4]: nsHttpConnection::OnSocketReadable [this=857c9dc]
3[8043cd4]: recv: fd=83957f0 osfd=170830740 buf=851256c amount=4096 flags=0
3[8043cd4]: recv -> -1, error = -5994, os error = -3155
3[8043cd4]: nsHttpConnection::CloseTransaction[this=857c9dc trans=84a71f0 
reason=80004005]
3[8043cd4]: nsHttpTransaction::Close [this=84a71f0 reason=80004005]
3[8043cd4]: nsHttpConnection::Close [this=857c9dc reason=80004005]
3[8043cd4]: nsHttpConnectionMgr::ReclaimConnection [conn=857c9dc]
3[8043cd4]:   connection cannot be reused; closing connection
3[8043cd4]: nsHttpConnection::Close [this=857c9dc reason=80004004]

When the connection works, we get this:

3[8043cd4]: nsHttpConnection::OnSocketReadable [this=857b1ac]
3[8043cd4]: recv: fd=8396104 osfd=170830740 buf=85053cc amount=4096 flags=0
3[8043cd4]: recv -> 469, error = -5934, os error = 77
3[8043cd4]: nsHttpTransaction::OnSocketStatus [this=845f910 status=804b0006 
progress=469]
3[8043cd4]: nsHttpTransaction::ProcessData [this=845f910 count=469]
3[8043cd4]: nsHttpTransaction::ParseHead [count=469]
3[8043cd4]: nsHttpTransaction::ParseLine [HTTP/1.1 302 Found]
etc.

Perhaps we can change nsHttpConnection to deal with this situation, but this is 
a big change, so we need to test thoroughly.

Original issue reported on code.google.com by classi...@floodgap.com on 24 Apr 2011 at 3:56

Blocked on: #162

GoogleCodeExporter commented 9 years ago

Attached logs.

Original comment by classi...@floodgap.com on 24 Apr 2011 at 3:57

Attachments:

[all copy.log](https://storage.googleapis.com/google-code-attachments/classilla/issue-161/comment-1/all copy.log)
all.log

GoogleCodeExporter commented 9 years ago

Error -3155 appears to be a Open Transport socket connection error (meaning no 
likely fix in Bugzilla :( ). There are lots of reports in reference to it with 
other software:

OE 5, IE 5: http://forums.macnn.com/82/applications/64749/network-error-3155-a/

Entourage: http://support.microsoft.com/kb/883725
There is an intriguing note about MTUs in it. Unfortunately, I don't think we 
can reasonably expect users to fiddle with that.

Per http://support.apple.com/kb/TA35484?viewlocale=en_US it is kOTOutStateErr.
http://serenity.uncc.edu/web/ADC/2005/Developer_DVD_Series/April/ADC%20Reference
%20Library/documentation/Carbon/Reference/Open_Transport/open_transport_ref/cons
tant_190.html :

kOTProviderIsClosed
    The provider has closed. The reason for being closed can be found in the OTResult value passed to your notifier. The reasons typically are kOTPortHasDiedErr, kOTPortWasEjectedErr, or kOTPortLostConnectionErr. At this point, any calls other than OTCloseProvider will fail with a kOTOutStateErr error.

This looks like NSPR or even GUSI-level.

Original comment by classi...@floodgap.com on 24 Apr 2011 at 4:12

Changed title: NSPR should be able to handle error -3155 ifixoldmacs.com, classilla.org

GoogleCodeExporter commented 9 years ago

Probably not related, but useful for background: 
https://bugzilla.mozilla.org/show_bug.cgi?id=2133

Original comment by classi...@floodgap.com on 24 Apr 2011 at 4:17

GoogleCodeExporter commented 9 years ago

More background: https://bugzilla.mozilla.org/show_bug.cgi?id=28906

Original comment by classi...@floodgap.com on 24 Apr 2011 at 4:38

GoogleCodeExporter commented 9 years ago

Meta: https://bugzilla.mozilla.org/show_bug.cgi?id=19119

Since this is timing related, we should fix issue 162 first, and see what 
difference that makes.

Original comment by classi...@floodgap.com on 24 Apr 2011 at 4:48

GoogleCodeExporter commented 9 years ago

In the meantime, altered the DNS to point directly to floodgap.com and let 
HTTPi sort it out. This wallpapers the problem at least for us. tenfourfox.com 
will now become our test case.

Original comment by classi...@floodgap.com on 30 Apr 2011 at 5:21

Changed title: NSPR should be able to handle error -3155 ifixoldmacs.com, tenfourfox.com

GoogleCodeExporter commented 9 years ago

Theory, aided in part by 
http://www.mactech.com/articles/develop/issue_27/macqa.html

This is a fairly subtle race condition in NSPR. Looking at 
nsprpub/pr/src/md/mac/macsockotpt.c ; our OpenTransport notifier is in 
NotifierRoutine().

1. We connect, transmit our string.
2. Other side transmits data, immediately disconnects (sends FIN).
3. We get first a T_DATA (and data is available), then a T_ORDREL, in the 
notifier.
4. In the state where this works, the T_ORDREL is sufficiently delayed to allow 
the asynchronous read to get the data first with OTRcv(). T_ORDREL comes along 
and the notifier calls OTRcvOrderlyDisconnect(), closing the connection. 
Further reads on that stream will give us the hated -3155, but this is okay 
because we have the data.
5. In the state where this fails, T_ORDREL is handled before the calling thread 
has a chance to read the data. Since the connection is disconnected at the 
notifier level, then T_ORDREL effectively has a higher priority than T_DATA, 
and effectively masks the data by closing the stream before the thread gets to 
it.

We will only fix this for reads, since writes don't seem to have an issue, 
though theoretically they might have a similar phenomenon.

This *should* work:
- When the notifier gets T_ORDREL, it should check OTCountDataBytes(endpoint, 
&count); to see if all data was read.
- If it was, then it will disconnect like currently and life goes on.
- If it was not, it will mark the socket as in a closing state (probably 
fd->secret->something, which by default will be false) and will not disconnect 
it. It then wakes up the read thread.
- The read thread then calls SendReceiveStream() as usual. It calls OTRcv() and 
presumably gets all the data that is available (we might have to write a 
special tight blocking read for this). At the end of SendReceiveStream(), the 
call looks at the state of the socket. If fd->secret->something is true, then 
it calls OTRcvOrderlyDisconnect() itself, discarding the error since we don't 
care.

Original comment by classi...@floodgap.com on 28 Jun 2011 at 12:09

GoogleCodeExporter commented 9 years ago

I was hoping Copperweb could let me simulate this on my internal network but no 
dice.

Original comment by classi...@floodgap.com on 1 Jul 2011 at 4:34

GoogleCodeExporter commented 9 years ago

Well, did some initial work on this. First, there is already an 
fd->secret->orderlyDisconnect and we will simply (ab)use that as our signal to 
close the socket within SendReceiveStream(). So far this has not broken 
anything and there are no unusual increases in leakage. If everything else is 
good then I will pop out a test build and try to replicate this on the iBook.

Original comment by classi...@floodgap.com on 1 Jul 2011 at 5:10

Changed state: Started

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 1 Jul 2011 at 5:20

Added labels: Milestone-Release9.2.3

GoogleCodeExporter commented 9 years ago

It's not T_ORDREL. It's T_DISCONNECT, with a disconnect error of 54. This is 
ECONNRESET = 54;   { Connection reset by peer          } (which makes sense).

We know IE can deal with this, so there must be a workaround.

Original comment by classi...@floodgap.com on 3 Jul 2011 at 11:58

GoogleCodeExporter commented 9 years ago

So the current approach hasn't worked at all; we still have a serious race 
condition.

Realistically all I can think of is to build a queue of network states and have 
NSPR access it. This is a big hack but may work.

Original comment by classi...@floodgap.com on 30 Aug 2011 at 3:39

GoogleCodeExporter commented 9 years ago

For the queue, it should consist of the event type (T_ORDREL, T_DISCONNECT, 
etc.), any useful info, a bookmark and a pointer to a buffer. The queue should 
be circular and probably should include at least 15 entries. There should be a 
"thread is at this point" and a "queue is at this point".

In the notifier, an event occurs, the notifier stuffs the state and grabs the 
buffer with OTRcv, if any, and the buffer is created with an OTAllocMem so that 
we can put it in temporary memory. The bookmark is set to zero. We then wake up 
the NSPR read thread like usual. This only needs to be done for read states.

In the read thread, the thread then goes through all of the pending states that 
occurred since it was last awakened. If it needs to read data and there is MORE 
data available than it wanted, it drops a bookmark saying where it left off in 
the buffer, and goes back to sleep after returning the data. Otherwise, it 
copies the buffer, OTFreeMems it to give it back, and returns that. If there 
are more states, it keeps going, including to our T_DISCONNECT. This *should* 
ensure that no data is dropped because the notifier will always get a chance. 
If the notifier is still not fast enough, then I have no idea how to solve this.

Original comment by classi...@floodgap.com on 30 Aug 2011 at 3:48

GoogleCodeExporter commented 9 years ago

er, copies from the bookmark to the end of the buffer, but OTFreeMems the whole 
buffer

Original comment by classi...@floodgap.com on 30 Aug 2011 at 3:51

GoogleCodeExporter commented 9 years ago

Scheduling for 9.3.0 instead since 9.2.3 is picking up other critical issues

Original comment by classi...@floodgap.com on 10 Oct 2011 at 9:03

Added labels: Milestone-Release9.3.0
Removed labels: Milestone-Release9.2.3

GoogleCodeExporter commented 9 years ago

We can create a test that creates just such a queue but doesn't allocate memory 
to it (this is the tricky part). All the notifier would do is stuff events. If 
the read thread sees that it is missing events, then we know the notifier is 
fast enough to queue things. We can then add the alloc and read logic.

The real trick is simulating this internally.

Current algorithm thought:

thread_is_at = 0
queue_is_at = 0
notifier:
get status
stuff status in new queue entry
#if FOR REALSIE
if read, { OTAllocMem and OTRcv into that buffer
save length of buffer
set buffer bookmark to 0 }
#endif
increment queue_is_at
wake up read thread
do again for all statuses

read thread (when it awakens)
if queue_is_at < thread_is_at {
get current queue entry
if read event, { 
#if FOR REALSIE
if length-bookmark > bytes desired { bcopy from bookmark to bookmark + bytes 
desired, bookmark += bytes desired }
else { bcopy from bookmark to length, OTFreeMem buffer, increment thread_is_at }
#else
notify, do OTRcv ourselves as we do now, increment thread_is_at
#endif
}
if other event, { standard handler, increment thread_is_at }
ONLY DO ONE SUCH QUEUE ENTRY to maintain NSPR calling convention
return
}

Original comment by classi...@floodgap.com on 3 Nov 2011 at 1:28

GoogleCodeExporter commented 9 years ago

all increments would be (x++) & 15

Original comment by classi...@floodgap.com on 3 Nov 2011 at 1:29

GoogleCodeExporter commented 9 years ago

We could make this simpler, maybe. Stuff the last T_DATA event only (instead of 
a rotating queue). When T_DISCONNECT comes around, or T_LOOK, have it check the 
last T_DATA event and salvage what it can. This might be "good enough."

Original comment by classi...@floodgap.com on 3 Dec 2011 at 3:28

GoogleCodeExporter commented 9 years ago

This didn't work, so we'll have to go back to the old idea (in 
macsockotpt.c.9.2.3). However, we don't need the queue, just deferredDi, as 
before.

Original comment by classi...@floodgap.com on 3 Dec 2011 at 4:43

GoogleCodeExporter commented 9 years ago

This is holding up 9.3.0 and I can't get it to not be flaky. I need to rethink 
this. Restoring the 9.2.3 code and we'll keep this rolling.

Original comment by classi...@floodgap.com on 31 Dec 2011 at 10:41

Added labels: Priority-High
Removed labels: Milestone-Release9.3.0, Priority-Critical

GoogleCodeExporter commented 9 years ago

ifixoldmacs now works ... !

Original comment by classi...@floodgap.com on 12 Jan 2012 at 3:00

GoogleCodeExporter commented 9 years ago

Well, it *did*. Must have been a blip.

Original comment by classi...@floodgap.com on 17 Jan 2012 at 3:43

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I wonder if using OTMP instead of notifiers would be a good idea, though this 
would require a bump of the system requirements up to 9.0.

Original comment by yuhongba...@hotmail.com on 21 Jan 2015 at 9:06

KawaiiBASIC / classilla

NSPR should be able to handle error -3155 [ifixoldmacs.com, tenfourfox.com] (GoDaddy) #161