Loop of Death on Broken Pipe.

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. After running for a few days, the FTP server spikes to 100% CPU usage.
2. FTP still works, so the server does not crash, it just goes to 100% CPU 
until restarted.

What is the expected output? What do you see instead?
strace shows the following loop:

--- SIGPIPE (Broken pipe) @ 0 (0) ---
gettimeofday({1322682848, 73400}, NULL) = 0
select(10, [3 5 9], [4 9], [3 4 5 9], {1, 0}) = 1 (in [9], left {1, 0})
write(9, "", 0)                         = 0
write(9, "\25\3\1\0\22\215\341\377\240^]\30\216\32!\305\00204\5\337&6", 23) = 
-1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
gettimeofday({1322682848, 73900}, NULL) = 0
select(10, [3 5 9], [4 9], [3 4 5 9], {1, 0}) = 1 (in [9], left {1, 0})
write(9, "", 0)                         = 0
write(9, "\25\3\1\0\22\215\341\377\240^]\30\216\32!\305\00204\5\337&6", 23) = 
-1 EPIPE (Broken pipe)
...

lsof shows me the following information about fd 9.

COMMAND    PID      USER   FD   TYPE             DEVICE      SIZE      NODE NAME
python26 26974 smartfile    9u  sock                0,5            38167901 
can't identify protocol

I am not sure where in the code this loop is and why the server cannot break 
out of it.

I am running a recent SVN version (revision: 908)

# $Id: ftpserver.py 908 2011-11-19 18:15:52Z g.rodola $

Original issue reported on code.google.com by btimby@gmail.com on 30 Nov 2011 at 7:58

GoogleCodeExporter commented 9 years ago

Are you at least able to figure out whether this was an FTPS session?

Original comment by g.rodola on 30 Nov 2011 at 8:01

GoogleCodeExporter commented 9 years ago

The data being written to the fd does not contain any FTP protocol markers. 
Therefore my guess is that it is crypted.

Other than that, I don't know how to tell.

Original comment by btimby@gmail.com on 30 Nov 2011 at 8:03

GoogleCodeExporter commented 9 years ago

Please try r923.
This shouldn't fix the EPIPE error but at least the code should no longer end 
up being stucked into an infinite loop between calls (_do_ssl_shutdown() -> 
handle_error() -> close()).

Obviosuly, I'm just assuming the problem is there (but it's likely).

Original comment by g.rodola on 30 Nov 2011 at 8:15

GoogleCodeExporter commented 9 years ago

Indeed. Broken pipe is not the issue. But trying it over and over is. I will 
try this out.

Just your quick opinion, are the changes between r908 and r923 "safe". Would 
you suggest simply updating or patching in this change?

Original comment by btimby@gmail.com on 30 Nov 2011 at 8:33

GoogleCodeExporter commented 9 years ago

I would suggest updating.
The changes introduced in that range are mostly bugfixes and a new "feature" 
(the faster scheduler), both of which are not supposed to introduce retro 
compatibility issues.

Original comment by g.rodola on 30 Nov 2011 at 8:39

GoogleCodeExporter commented 9 years ago

I have updated the servers and restarted them. I will report back in a week if 
I don't see this issue again. It happens every couple days and I am not sure 
how it is triggered (by a user). So after a week, I would assume this change 
resolved it.

Original comment by btimby@gmail.com on 30 Nov 2011 at 9:02

GoogleCodeExporter commented 9 years ago

Even if the infinite loop is fixed please report EPIPE with the complete 
traceback.

Original comment by g.rodola on 30 Nov 2011 at 9:26

GoogleCodeExporter commented 9 years ago

With this change in place, I am having errors connecting via FTPES or FTPS 
(With FileZilla).

--
Status: Server did not properly shut down TLS connection
Error:  Could not read from transfer socket: ECONNABORTED - Connection aborted
--

This is something that happened in older proftpd versions,
here is the fix:

http://proftp.cvs.sourceforge.net/viewvc/proftp/proftpd/contrib/mod_tls.c?r1=1.1
38&r2=1.139

The explanation is:

http://marc.info/?l=proftpd-users&m=121736627908173&w=2

If I roll back to r908, I am able to connect once again.

Original comment by btimby@gmail.com on 1 Dec 2011 at 3:47

GoogleCodeExporter commented 9 years ago

This should be treated as a separate issue.
Can you open one please?

Original comment by g.rodola on 1 Dec 2011 at 8:35

GoogleCodeExporter commented 9 years ago

I committed two revisions to complete the changes made in r923.

r925 - added super() calls to SSLConnection.__init__() and 
TLS_FTPHandler.__init__(). This allows the _error attribute to be added when a 
TLS_FTPHandler instance is created. Without this, the handle_read() method will 
die with an AttributeError.

r926 - added super() call to TLS_DTPHandler as well. For the same reason as 
above.

Original comment by btimby@gmail.com on 2 Dec 2011 at 11:16

GoogleCodeExporter commented 9 years ago

I don't know if you want a separate issue for this, but here is another 100% 
CPU usage scenario, but with a different strace loop (no broken pipe, but 
EAGAIN).

strace:
--
select(9, [3 5 8], [8], [3 5 8], {1, 0}) = 1 (out [8], left {1, 0})
time(NULL)                              = 1323108657
read(8, 0x4d9b290, 11)                  = -1 EAGAIN (Resource temporarily 
unavailable)
gettimeofday({1323108657, 587397}, NULL) = 0
select(9, [3 5 8], [8], [3 5 8], {1, 0}) = 1 (out [8], left {1, 0})
time(NULL)                              = 1323108657
read(8, 0x4d9b290, 11)                  = -1 EAGAIN (Resource temporarily 
unavailable)
gettimeofday({1323108657, 587629}, NULL) = 0
select(9, [3 5 8], [8], [3 5 8], {1, 0}) = 1 (out [8], left {1, 0})
time(NULL)                              = 1323108657
read(8, 0x4d9b290, 11)                  = -1 EAGAIN (Resource temporarily 
unavailable)
gettimeofday({1323108657, 587861}, NULL) = 0
select(9, [3 5 8], [8], [3 5 8], {1, 0}) = 1 (out [8], left {1, 0})
time(NULL)                              = 1323108657
read(8, 0x4d9b290, 11)                  = -1 EAGAIN (Resource temporarily 
unavailable)
gettimeofday({1323108657, 588092}, NULL) = 0
select(9, [3 5 8], [8], [3 5 8], {1, 0}) = 1 (out [8], left {1, 0})
time(NULL)                              = 1323108657
read(8, 0x4d9b290, 11)                  = -1 EAGAIN (Resource temporarily 
unavailable)
gettimeofday({1323108657, 588324}, NULL) = 0
--

and lsof shows a connection from an FTP client:

--
python26 1593 smartfile    8u  IPv4           59132471               TCP 
10.19.78.11:ftp->mustafar:38337 (ESTABLISHED)
--

Mustafar is our proxy/load balancer used for FTP connections.

Original comment by btimby@gmail.com on 5 Dec 2011 at 6:13

GoogleCodeExporter commented 9 years ago

It is possible that r923 fixes this too, but I can't run that due to issue 193.

Original comment by btimby@gmail.com on 5 Dec 2011 at 6:15

GoogleCodeExporter commented 9 years ago

Here is a traceback showing a Broken Pipe error.

Probably one similar to what was causing the death loop (note the past tense).

--
Traceback (most recent call last):

  File "/usr/lib64/python2.6/asynchat.py", line 235, in initiate_send
    num_sent = self.send(data)

  File "/usr/lib/python2.6/site-packages/pyftpdlib/contrib/handlers.py", line 171, in send
    return super(SSLConnection, self).send(data)

  File "/usr/lib/python2.6/site-packages/pyftpdlib/ftpserver.py", line 1065, in send
    result = asyncore.dispatcher.send(self, data)

  File "/usr/lib64/python2.6/asyncore.py", line 352, in send
    result = self.socket.send(data)

error: [Errno 32] Broken pipe
--

Original comment by btimby@gmail.com on 7 Dec 2011 at 3:28

GoogleCodeExporter commented 9 years ago

Ok, I think I'm going to remove CCC command feature (issue 172).
ssl_shutdown() handling was already complicated and risky.
With CCC it became unmaintainable.

Original comment by g.rodola on 7 Dec 2011 at 7:23

Added labels: SSL
Removed labels: OpSys-Linux

GoogleCodeExporter commented 9 years ago

I removed CCC command implementation as of r934 where I tried to handle all the 
SSL-related error conditions.
Please try this new rev one more time and let me know if you experience any 
problem.

Original comment by g.rodola on 7 Dec 2011 at 8:19

GoogleCodeExporter commented 9 years ago

I will install this in a couple days. I have restarted the FTP services too 
many times this week :-). I will update over the weekend.

I will install on my test environment right away.

Original comment by btimby@gmail.com on 7 Dec 2011 at 8:22

GoogleCodeExporter commented 9 years ago

Original comment by g.rodola on 7 Dec 2011 at 8:37

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

I put this into production today. I was seeing hundreds of SSL related errors 
(like below).

--
Traceback (most recent call last):

  File "/usr/lib64/python2.6/asyncore.py", line 76, in read
    obj.handle_read_event()

  File "/var/www/smartfile/cli/pyftpd/smartfile.py", line 150, in handle_read_event
    super(ProxyProtocolHandler, self).handle_read_event()

  File "/usr/lib/python2.6/site-packages/pyftpdlib/contrib/handlers.py", line 146, in handle_read_event
    super(SSLConnection, self).handle_read_event()

  File "/usr/lib64/python2.6/asyncore.py", line 416, in handle_read_event
    self.handle_read()

  File "/usr/lib64/python2.6/asynchat.py", line 110, in handle_read
    data = self.recv (self.ac_in_buffer_size)

  File "/usr/lib/python2.6/site-packages/pyftpdlib/contrib/handlers.py", line 188, in recv
    return super(SSLConnection, self).recv(buffer_size)

  File "/usr/lib64/python2.6/asyncore.py", line 365, in recv
    data = self.socket.recv(buffer_size)

Error: [('SSL routines', 'SSL3_WRITE_PENDING', 'bad write retry')]
--

I will let you know if this latest change (removal of CCC) resolves them.

Original comment by btimby@gmail.com on 19 Dec 2011 at 10:28

GoogleCodeExporter commented 9 years ago

This update eradicated most of the SSL errors I was receiving. There are still 
a few, but it is one or two a day rather than hundreds.

I have not seen any server go to 100% CPU usage in about two weeks now. So I 
would deem this issue resolved.

Original comment by btimby@gmail.com on 21 Dec 2011 at 6:17

GoogleCodeExporter commented 9 years ago

Good. Please file a separate issue in case SSL3_WRITE_PENDING comes up again.

Original comment by g.rodola on 22 Dec 2011 at 10:56

Changed state: FixedInSVN
Added labels: Milestone-0.6.1, Version-0.6.0
Removed labels: Python-2.6

GoogleCodeExporter commented 9 years ago

Original comment by g.rodola on 3 Jan 2012 at 11:40

Added labels: Milestone-0.7.0

GoogleCodeExporter commented 9 years ago

I am still occasionally seeing the same broken pipe error:

gettimeofday({1326146598, 123463}, NULL) = 0
select(20, [3 5 11], [19], [3 5 11 19], {1, 0}) = 1 (in [11], left {1, 0})
write(11, "", 0)                        = 0
write(11, 
"\25\3\1\0\30\334K\217D\n0tYm\333\355\273]e!\230R\207\254\0J\343E\354", 29) = 
-1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
gettimeofday({1326146598, 123721}, NULL) = 0
select(20, [3 5 11], [19], [3 5 11 19], {1, 0}) = 1 (in [11], left {1, 0})
write(11, "", 0)                        = 0
write(11, 
"\25\3\1\0\30\334K\217D\n0tYm\333\355\273]e!\230R\207\254\0J\343E\354", 29) = 
-1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---

And the same lsof output:

python26 1759 user   11u     sock                0,5           59896469 can't 
identify protocol

Original comment by btimby@gmail.com on 9 Jan 2012 at 10:06

GoogleCodeExporter commented 9 years ago

Does this come from send() method and inherits from socket.error?
Please paste the complete traceback message.

Original comment by g.rodola on 9 Jan 2012 at 11:05

GoogleCodeExporter commented 9 years ago

Also, is it just an exception or you are getting into the endless loop again?

Original comment by g.rodola on 9 Jan 2012 at 11:16

GoogleCodeExporter commented 9 years ago

Original comment by g.rodola on 9 Jan 2012 at 11:23

Changed state: Reopened

GoogleCodeExporter commented 9 years ago

I don't have any error reports that occurred before the CPU went to 100%.

Original comment by btimby@gmail.com on 17 Jan 2012 at 2:30

GoogleCodeExporter commented 9 years ago

Ok, it seems we're blind here.
The only thing I can recommend is adding debugging logs to figure out what's 
going on.
Something like the diff below.
Can you do that?

Index: pyftpdlib/contrib/handlers.py
===================================================================
--- pyftpdlib/contrib/handlers.py   (revisione 967)
+++ pyftpdlib/contrib/handlers.py   (copia locale)
@@ -159,17 +159,23 @@
             try:
                 return super(SSLConnection, self).send(data)
             except (SSL.WantReadError, SSL.WantWriteError):
+                log("SSL want read/write")
                 return 0
             except SSL.ZeroReturnError:
+                log("SSL zero return")
                 super(SSLConnection, self).handle_close()
                 return 0
             except SSL.SysCallError, (errnum, errstr):
+                log("SSL sys call error %s" % errstr)
                 if errstr == 'Unexpected EOF' or errnum == errno.EWOULDBLOCK:
+                    log("EOF or EWOULDBLOCK")
                     return 0
                 elif errnum in _DISCONNECTED:
+                    log("disconnected")
                     super(SSLConnection, self).handle_close()
                     return 0
                 else:
+                    log("some other error")
                     raise

Original comment by g.rodola on 17 Jan 2012 at 2:47

GoogleCodeExporter commented 9 years ago

Oh, hang on. I noticed an error in the code. On send(), we were treating EOF 
error as an alias for "retry" rather than "connection lost". 
That looks like a good candidate able to create an infinite loop.
Please try r969.

Original comment by g.rodola on 17 Jan 2012 at 5:43

GoogleCodeExporter commented 9 years ago

Ben can I close this one?

Original comment by g.rodola on 12 May 2012 at 11:44

GoogleCodeExporter commented 9 years ago

Yes. I have not had a crash issue in months.

Original comment by btimby@gmail.com on 14 May 2012 at 1:06

GoogleCodeExporter commented 9 years ago

Original comment by g.rodola on 15 May 2012 at 2:30

Changed state: Fixed

4udak / pyftpdlib

Loop of Death on Broken Pipe. #191