Clozure / ccl

Clozure Common Lisp
http://ccl.clozure.com
Apache License 2.0
855 stars 103 forks source link

TCP sockets stop working after a few writes, "Bad file descriptor (error #9) during write" #504

Closed SuperDisk closed 4 months ago

SuperDisk commented 4 months ago

I'm messing around with the echo server example from websocket-driver, and I found that it has issues with CCL but not with other implementations.

After writing to a socket a couple times, the connection would inexplicably close-- I removed the handler which suppresses the error, and found that it's producing this:

on #<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001DAEFFD> : 
Bad file descriptor (error #9) during write

The first few writes to the WebSocket work, and then after about 4 or 5 writes, this happens. The error comes from here in the call to force-output:

https://github.com/Clozure/ccl/blob/4c6c0fc12b44575ac27cfa23172fcf70b5d38ff2/level-1/l1-streams.lisp#L5749

The call is here in websocket-driver: https://github.com/fukamachi/websocket-driver/blob/ff606c1ff373b6b07380978f848cf6b6aa192977/src/ws/client.lisp#L207

Any idea what it could be?

Also, I just wanted to express my appreciation for your hard work maintaining this. I've been daily driving CCL while building a game and it's been quite nice.

xrme commented 4 months ago

What is the output of (lisp-implementation-version) on your system, please? Maybe I could try to duplicate your issue.

The (NIL/4) in the print-object output means that CCL can't tell what kind of thing that file descriptor 4 is, which is not surprising given the EBADF error code. Normally, you'd expect to see something like #<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/7) #x3020005BF03D>.

I'm guessing wildly, but maybe you could trace some relevant close function to see if there's something closing the stream unexpectedly? Is there possibly a race condition where a thread closes the socket while some other thread is still trying to access it?

SuperDisk commented 4 months ago
CL-USER> (lisp-implementation-version)
"Version 1.12.2 (v1.12.2) LinuxX8664"

I ran this code:

 (ql:quickload :websocket-driver-client)

 (progn
   (trace (:package :websocket-driver.ws.client))
   (trace (:package :websocket-driver.ws.base)))

 (defparameter *client* (wsd:make-client "ws://localhost:5000/echo"))
 (wsd:start-connection *client*)

which should trace everything that happens on the websocket side.

The result is here: https://gist.github.com/SuperDisk/cbbab59d3b4966b2561b9acf0cde7b88

I'm not sure if there's any useful information though-- the TCP stream does look like it gets created successfully at first (it shows SOCKET like you said but then becomes NIL). If I don't trace, I can usually get in 5 or 6 writes before it fails, but when tracing, it never works.

The echo-server code is here: https://github.com/fukamachi/websocket-driver/?tab=readme-ov-file#server-side-with-clack

I'm currently running that with SBCL for testing. Please let me know if there's anything else I can do.

SuperDisk commented 4 months ago

I just updated to the latest on master, and still getting the same issue.

CL-USER> (lisp-implementation-version)
"Version 1.12.2 (v1.12.2-83-g4c6c0fc1) LinuxX8664"
SuperDisk commented 4 months ago

All the logic seems to be here, which is the read thread for the socket. I added some format statements just to see what's happening:

      (setf (read-thread client)
            (bt2:make-thread
             (lambda ()
               (unwind-protect
                    (loop until (eq (ready-state client) :closed)
                          do
                             (format t "~a~%" (socket client))
                             (when (listen (socket client))
                               (format t "successfully listened to socket~%")
                               (parse client (read-websocket-frame (socket client))))
                             (sleep 0.01))
                 (close-connection client)))
             :name "websocket client read thread"
             :initial-bindings `((*standard-output* . ,*standard-output*)
                                 (*error-output* . ,*error-output*))))
      client)))

The output is like this, and happens after about 5 seconds. Nothing else is closing the stream or socket I don't think.

#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
xrme commented 4 months ago

I'm trying to duplicate your issue, but it seems to be working for me.

I've tried it on my Mac, and on a virtual machine running Ubuntu 22.04.4.

My steps:

Linux Install sbcl 2.4.6 binary distribution apt install libuv1-dev apt install libssl-dev install quicklisp

create file issue-504.lisp containing the following:

(in-package :cl-user)

(ql:quickload '(:websocket-driver-server :clack))

(use-package :websocket-driver)

(defvar *echo-server*
  (lambda (env)
    (let ((ws (make-server env)))
      (on :message ws
          (lambda (message)
            (send ws message)))
      (lambda (responder)
        (declare (ignore responder))
        (start-connection ws)))))

;; Start Wookie server
(clack:clackup *echo-server* :server :wookie :port 5000)

Start server as follows:

rme@jammy:~$ sbcl
This is SBCL 2.4.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* (load "~/quicklisp/setup")
T
* (load "issue-504")
To load "websocket-driver-server":
  Load 1 ASDF system:
    websocket-driver-server
; Loading "websocket-driver-server"
............
To load "clack":
  Load 1 ASDF system:
    clack
; Loading "clack"
.
Wookie server is started.
Listening on 127.0.0.1:5000.
T
* 

As for the client side, I install the latest CCL, and do the following:

rme@jammy:~$ ccl
Clozure Common Lisp Version 1.12.2 (v1.12.2-83-g4c6c0fc1) LinuxX8664

For more information about CCL, please see https://ccl.clozure.com.

CCL is free software.
It is distributed under the terms of the Apache License, Version 2.0.
? (load "home:quicklisp;setup")
? (ql:quickload :websocket-driver-client)
[...]
? (defvar *client* (wsd:make-client "ws://localhost:5000/echo"))
*CLIENT*
? (wsd:start-connection *client*)
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A81FCD>
? (wsd:on :message *client*
        (lambda (message)
          (format t "~&Got: ~A~%" message)))
#(#S(EVENT-EMITTER::LISTENER :FUNCTION #<Anonymous Function #x302001AAFE7F> :ONCE NIL))
? (wsd:send *client* "Hi")
NIL
? 
Got: Hi
(wsd:send *client* "Dark is the water of Kheled-zâram, and cold are the waters of Kibil-nâla, and fair were the many-pillared halls of Khazad-dum in elder days before the fall of mighty kings beneath the stone")
NIL
? 
Got: Dark is the water of Kheled-zâram, and cold are the waters of Kibil-nâla, and fair were the many-pillared halls of Khazad-dum in elder days before the fall of mighty kings beneath the stone
(loop repeat 10 do (wsd:send *client* "There are older and fouler things than Orcs in the deep places of the world."))
NIL
? 
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
xrme commented 4 months ago

Update: I just saw something like your reported issue.

To start out, I traced ccl::fd-close. Then:

? (wsd:close-connection *client*)
NIL
? (setq *client* (wsd:make-client "ws://localhost:5000/echo"))
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001B5EC3D>
? (wsd:start-connection *client*)
0> Calling (CCL::FD-CLOSE 4) 
<0 CCL::FD-CLOSE returned 0
0> Calling (CCL::FD-CLOSE 8) 
<0 CCL::FD-CLOSE returned 0
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A43A4D>
? 
0> Calling (CCL::FD-CLOSE 4) 
<0 CCL::FD-CLOSE returned 0
(wsd:send *client* "hi")
0> Calling (CCL::FD-CLOSE 4) 
<0 CCL::FD-CLOSE returned -9
NIL
? 

So, here we see that right after the call to wsd:start-connection returned, something closed file descriptor 4. When we later try to close file descriptor 4 again, we get EBADF.

Let's try to get a little more context:

? (untrace)
(CCL::FD-CLOSE)
? (trace :backtrace t ccl::fd-close)

This will show a backtrace on entry to the traced function. Now, if I close and restart the client, I see this:

? (wsd:close-connection *client*)
NIL
? (setq *client* (wsd:make-client "ws://localhost:5000/echo"))
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001C9EC3D>
? (wsd:start-connection *client*)
[...]
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A3C98D>
? 
0> Calling (CCL::FD-CLOSE 4) 
 (7F25DE38DB50) : 0 (PRINT-CALL-HISTORY :CONTEXT NIL :PROCESS NIL :ORIGIN NIL :DETAILED-P NIL :COUNT 1152921504606846975 :START-FRAME-NUMBER 0 :STREAM #<SYNONYM-STREAM to *TERMINAL-IO* #x3020004424BD> :PRINT-LEVEL 2 :PRINT-LENGTH 5 :PRINT-STRING-LENGTH :DEFAULT :SHOW-INTERNAL-FRAMES NIL :FORMAT :TRADITIONAL) 965
 (7F25DE38DCE8) : 1 (FUNCALL #'#<(CCL::TRACED CCL::FD-CLOSE)> 4) 1269
 (7F25DE38DD60) : 2 (%%IOBLOCK-CLOSE #S(CCL::IOBLOCK :STREAM #<BASIC-TCP-STREAM :CLOSED #x302001A4987D> :UNTYI-CHAR NIL ...)) 253
 (7F25DE38DD88) : 3 (%%BEFORE-AND-AFTER-COMBINED-METHOD-DCODE (NIL #<CCL::STANDARD-KERNEL-METHOD CLOSE :AFTER #> . 17475077348298)) 1261
 (7F25DE38DDF8) : 4 (%%STANDARD-COMBINED-METHOD-DCODE ((#<#>) (#<#>) #<CCL::STANDARD-KERNEL-METHOD CLOSE #>) 17475077348298) 261
 (7F25DE38DE70) : 5 (%%CHECK-KEYWORDS #(1 #(:ABORT) #<Combined-Method CLOSE #x30200044706F>) 17475077348311) 261
 (7F25DE38DED8) : 6 (DRAIN-TERMINATION-QUEUE) 517
 (7F25DE38DF48) : 7 (FUNCALL #'#<(:INTERNAL CCL::ADD-GC-HOOK)>) 85
 (7F25DE38DF68) : 8 (HOUSEKEEPING) 37
 (7F25DE38DF78) : 9 (HOUSEKEEPING-LOOP) 517
 (7F25DE38DFB8) : 10 (FUNCALL #'#<(:INTERNAL (TOPLEVEL-FUNCTION (CCL::LISP-DEVELOPMENT-SYSTEM T)))>) 269
<0 CCL::FD-CLOSE returned 0

Notice the call to drain-termination-queue. By default, sockets in CCL are created with :auto-close t. This means that when the GC can prove that the steam is unreachable, it will be closed automatically via a termination function. This is telling me that CCL thinks that the stream associated with file descriptor 4 has become unreachable more or less immediately after calling wsd:start-connection.

xrme commented 4 months ago

By the way, you can invoke the gc manually to trigger the automatic draining of the termination queue. So, after doing (wsd:start-connection *client*), you can type (gc), and this will probably cause fd-close to be called on the apparently now-unreachable stream.

When tracing is involved, there's probably enough extra consing to cause the gc to be invoked often. If tracing is off, then you probably have enough headroom to do the 4 or 5 writes you talk about before things start failing.

xrme commented 4 months ago

OK, one last thing today...

If you go to the usocket sources (e.g., .../quicklisp/dists/quicklisp/software/usocket-0.8.6) and edit backend/clozure.lisp to add :auto-close nil to the openmcl-socket:make-socket call, I wonder if that will solve the problem.

https://github.com/usocket/usocket/blob/deed84252d0eb3d200e06a67015e63813eca4dec/backend/clozure.lisp#L37-L51

diff --git a/backend/clozure.lisp b/backend/clozure.lisp
index 006f563..ae8b5c6 100644
--- a/backend/clozure.lisp
+++ b/backend/clozure.lisp
@@ -46,6 +46,7 @@
                                       :format ,(to-format element-type protocol)
                                       :external-format ,ccl:*default-external-format*
                                       :deadline ,deadline
+                                      :auto-close nil
                                       :nodelay ,nodelay
                                       :connect-timeout ,timeout
                                       :input-timeout ,timeout))))

I still don't understand what's causing CCL to think that the fd-stream in question has become unreachable, but I'm reluctant to point the finger at CCL's gc (or the termination mechanism) just yet.

SuperDisk commented 4 months ago

Ah, great investigation and thank you for the tracing tips! That's pretty curious and the GC certainly explains the tracing/not tracing behavior.

metayan commented 4 months ago

If you go to the usocket sources (e.g., .../quicklisp/dists/quicklisp/software/usocket-0.8.6) and edit backend/clozure.lisp to add :auto-close nil to the openmcl-socket:make-socket call, I wonder if that will solve the problem.

https://github.com/usocket/usocket/blob/deed84252d0eb3d200e06a67015e63813eca4dec/backend/clozure.lisp#L37-L51

diff --git a/backend/clozure.lisp b/backend/clozure.lisp
index 006f563..ae8b5c6 100644
--- a/backend/clozure.lisp
+++ b/backend/clozure.lisp
@@ -46,6 +46,7 @@
                                       :format ,(to-format element-type protocol)
                                       :external-format ,ccl:*default-external-format*
                                       :deadline ,deadline
+                                      :auto-close nil
                                       :nodelay ,nodelay
                                       :connect-timeout ,timeout
                                       :input-timeout ,timeout))))

Tested the issue and found it to be present on both macOS and Linux. (x8664) Can confirm that the propsed solution works on both systems.

Many thanks to @SuperDisk for finding and @xrme for fixing.

xrme commented 4 months ago

I can see how it would be reasonable for usocket to use :auto-close nil since that might better match other implementations.

While it's certainly possible for CCL to have bugs (I see them all the time), I think it is more likely in this case that there is something in the application software code that is causing this behavior. I'm willing to be proved wrong, though.

SuperDisk commented 4 months ago

FWIW, it seems that the issue isn't specific to websocket-driver. I made a very simple "echo server" and connected to it directly with usocket and it's exhibiting the same problem:

CL-USER> 
(ql:quickload :usocket)
To load "usocket":
  Load 1 ASDF system:
    usocket
; Loading "usocket"

(:USOCKET)
CL-USER> 
(defvar stream (usocket:socket-stream (usocket:socket-connect "localhost" 12345)))
STREAM
CL-USER> stream
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200105D6CD>
CL-USER> (gc)
NIL
CL-USER> stream
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302000DC6E8D>
CL-USER> 
SuperDisk commented 4 months ago

Interestingly, when I use CCL's socket API directly, it doesn't have the problem. So it must be something in usocket? Just by a quick glance around the usocket code though, I don't see anything out of the ordinary. This function is basically the only thing that gets called, and it just opens a socket and puts it in a thin wrapper object.... so I'm a bit stumped.

CL-USER> (defparameter sock (make-socket :remote-host "localhost" :remote-port 12345))
SOCK
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000DFE8ED>
CL-USER> (gc)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000D0365D>
CL-USER> (format sock "yeah")(force-output sock)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000D0365D>
CL-USER> (gc)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000CFA73D>
CL-USER>
SuperDisk commented 4 months ago

OK, I think I've tracked down what causes the bug.

usocket's socket-connect-internal function tries first to open an IPv6 socket. If this fails, ccl:socket-creation-error is raised, which usocket handles, then tries opening an IPv4 socket, which succeeds.

The problem happens when the failing IPv6 make-socket call is allowed to happen. When it does, the successfully created socket will be have the same fd as the failed one, and when the old stream is GC'd, the fd is closed. If you take out the :internet6 attempt by modifying this line:

:for address-family :in '(:internet6 :internet)

to

:for address-family :in '(:internet)

everything works fine. Just to verify even further, I tried an alternative method, where I modified the socket creation to be this:

(if (eq address-family :internet6)
  (error 'ccl:socket-creation-error)
  (apply #'openmcl-socket:make-socket
    ...

and it fixes the problem.

So I think this must be a CCL bug. I don't know any implementation details so this is pure speculation, but I'm thinking maybe the failed stream gets put in the drain termination queue you mentioned earlier, then the (doomed because it has the same fd as the old one) successful stream is created.

xrme commented 4 months ago

So some problem in the case where ccl:socket-creation-error is signaled, maybe. I'll investigate.

xrme commented 4 months ago

CCL is creating a stream before it calls connect(2) on the socket file descriptor. When connect(2) fails, ccl::make-tcp-socket closes the file descriptor, but the the file descriptor (which is just a fixnum) remains in the stream's data structures.

Suppose we call socket(2), and get file descriptor 4. We make a ccl::fd-stream with that file descriptor. Now, we try to call connect on file descriptor 4. If the connect fails, the fd is now useless and must be closed, which we do in ccl::make-tcp-socket. However, we've left that dangling file descriptor 4 in the fd-stream.

We try again, calling socket(2), and getting file descriptor 4. (We just closed it, remember, so the system gives us the next available file descriptor, which is again number 4). This time, assume connect(2) works. We now have a working connection.

However, the stream we created before the failing call to connect(2) is still around, and it still contains a dangling reference to file descriptor 4. Because :auto-close is t by default on socket streams, when that earlier stream becomes unreachable, the GC will close that first stream. So, file descriptor 4 gets closed. But that pulls the rug out from under the second stream from the time when connect(2) worked, because it is using file descriptor 4.

Would you like to review/try the change in #505? If that change works, I think that usocket can be left alone (i.e., it wouldn't need to pass :auto-close nil to ccl:make-socket).

xrme commented 4 months ago

@SuperDisk, I like the fix in #505 enough that I merged it. If you can update your ccl and try it, I'll be interested to hear if this fixes the issue for you.

@metayan, it's now not necessary to change usocket to use :auto-close nil to fix this issue. I think I'd prefer it to leave :auto-close to its default value.

SuperDisk commented 4 months ago

Just updated to latest and it works. I really appreciate investigation and work put into this, thanks.