Closed SuperDisk closed 4 months ago
What is the output of (lisp-implementation-version)
on your system, please? Maybe I could try to duplicate your issue.
The (NIL/4)
in the print-object output means that CCL can't tell what kind of thing that file descriptor 4 is, which is not surprising given the EBADF
error code. Normally, you'd expect to see something like #<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/7) #x3020005BF03D>
.
I'm guessing wildly, but maybe you could trace
some relevant close function to see if there's something closing the stream unexpectedly? Is there possibly a race condition where a thread closes the socket while some other thread is still trying to access it?
CL-USER> (lisp-implementation-version)
"Version 1.12.2 (v1.12.2) LinuxX8664"
I ran this code:
(ql:quickload :websocket-driver-client)
(progn
(trace (:package :websocket-driver.ws.client))
(trace (:package :websocket-driver.ws.base)))
(defparameter *client* (wsd:make-client "ws://localhost:5000/echo"))
(wsd:start-connection *client*)
which should trace everything that happens on the websocket side.
The result is here: https://gist.github.com/SuperDisk/cbbab59d3b4966b2561b9acf0cde7b88
I'm not sure if there's any useful information though-- the TCP stream does look like it gets created successfully at first (it shows SOCKET
like you said but then becomes NIL
). If I don't trace, I can usually get in 5 or 6 writes before it fails, but when tracing, it never works.
The echo-server code is here: https://github.com/fukamachi/websocket-driver/?tab=readme-ov-file#server-side-with-clack
I'm currently running that with SBCL for testing. Please let me know if there's anything else I can do.
I just updated to the latest on master, and still getting the same issue.
CL-USER> (lisp-implementation-version)
"Version 1.12.2 (v1.12.2-83-g4c6c0fc1) LinuxX8664"
All the logic seems to be here, which is the read thread for the socket. I added some format
statements just to see what's happening:
(setf (read-thread client)
(bt2:make-thread
(lambda ()
(unwind-protect
(loop until (eq (ready-state client) :closed)
do
(format t "~a~%" (socket client))
(when (listen (socket client))
(format t "successfully listened to socket~%")
(parse client (read-websocket-frame (socket client))))
(sleep 0.01))
(close-connection client)))
:name "websocket client read thread"
:initial-bindings `((*standard-output* . ,*standard-output*)
(*error-output* . ,*error-output*))))
client)))
The output is like this, and happens after about 5 seconds. Nothing else is closing the stream or socket I don't think.
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200209C40D>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302001F596AD>
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302001F596AD>
successfully listened to socket
I'm trying to duplicate your issue, but it seems to be working for me.
I've tried it on my Mac, and on a virtual machine running Ubuntu 22.04.4.
My steps:
Linux Install sbcl 2.4.6 binary distribution apt install libuv1-dev apt install libssl-dev install quicklisp
create file issue-504.lisp containing the following:
(in-package :cl-user)
(ql:quickload '(:websocket-driver-server :clack))
(use-package :websocket-driver)
(defvar *echo-server*
(lambda (env)
(let ((ws (make-server env)))
(on :message ws
(lambda (message)
(send ws message)))
(lambda (responder)
(declare (ignore responder))
(start-connection ws)))))
;; Start Wookie server
(clack:clackup *echo-server* :server :wookie :port 5000)
Start server as follows:
rme@jammy:~$ sbcl
This is SBCL 2.4.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (load "~/quicklisp/setup")
T
* (load "issue-504")
To load "websocket-driver-server":
Load 1 ASDF system:
websocket-driver-server
; Loading "websocket-driver-server"
............
To load "clack":
Load 1 ASDF system:
clack
; Loading "clack"
.
Wookie server is started.
Listening on 127.0.0.1:5000.
T
*
As for the client side, I install the latest CCL, and do the following:
rme@jammy:~$ ccl
Clozure Common Lisp Version 1.12.2 (v1.12.2-83-g4c6c0fc1) LinuxX8664
For more information about CCL, please see https://ccl.clozure.com.
CCL is free software.
It is distributed under the terms of the Apache License, Version 2.0.
? (load "home:quicklisp;setup")
? (ql:quickload :websocket-driver-client)
[...]
? (defvar *client* (wsd:make-client "ws://localhost:5000/echo"))
*CLIENT*
? (wsd:start-connection *client*)
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A81FCD>
? (wsd:on :message *client*
(lambda (message)
(format t "~&Got: ~A~%" message)))
#(#S(EVENT-EMITTER::LISTENER :FUNCTION #<Anonymous Function #x302001AAFE7F> :ONCE NIL))
? (wsd:send *client* "Hi")
NIL
?
Got: Hi
(wsd:send *client* "Dark is the water of Kheled-zâram, and cold are the waters of Kibil-nâla, and fair were the many-pillared halls of Khazad-dum in elder days before the fall of mighty kings beneath the stone")
NIL
?
Got: Dark is the water of Kheled-zâram, and cold are the waters of Kibil-nâla, and fair were the many-pillared halls of Khazad-dum in elder days before the fall of mighty kings beneath the stone
(loop repeat 10 do (wsd:send *client* "There are older and fouler things than Orcs in the deep places of the world."))
NIL
?
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Got: There are older and fouler things than Orcs in the deep places of the world.
Update: I just saw something like your reported issue.
To start out, I traced ccl::fd-close
. Then:
? (wsd:close-connection *client*)
NIL
? (setq *client* (wsd:make-client "ws://localhost:5000/echo"))
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001B5EC3D>
? (wsd:start-connection *client*)
0> Calling (CCL::FD-CLOSE 4)
<0 CCL::FD-CLOSE returned 0
0> Calling (CCL::FD-CLOSE 8)
<0 CCL::FD-CLOSE returned 0
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A43A4D>
?
0> Calling (CCL::FD-CLOSE 4)
<0 CCL::FD-CLOSE returned 0
(wsd:send *client* "hi")
0> Calling (CCL::FD-CLOSE 4)
<0 CCL::FD-CLOSE returned -9
NIL
?
So, here we see that right after the call to wsd:start-connection
returned, something closed file descriptor 4. When we later try to close file descriptor 4 again, we get EBADF
.
Let's try to get a little more context:
? (untrace)
(CCL::FD-CLOSE)
? (trace :backtrace t ccl::fd-close)
This will show a backtrace on entry to the traced function. Now, if I close and restart the client, I see this:
? (wsd:close-connection *client*)
NIL
? (setq *client* (wsd:make-client "ws://localhost:5000/echo"))
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001C9EC3D>
? (wsd:start-connection *client*)
[...]
#<WEBSOCKET-DRIVER.WS.CLIENT:CLIENT #x302001A3C98D>
?
0> Calling (CCL::FD-CLOSE 4)
(7F25DE38DB50) : 0 (PRINT-CALL-HISTORY :CONTEXT NIL :PROCESS NIL :ORIGIN NIL :DETAILED-P NIL :COUNT 1152921504606846975 :START-FRAME-NUMBER 0 :STREAM #<SYNONYM-STREAM to *TERMINAL-IO* #x3020004424BD> :PRINT-LEVEL 2 :PRINT-LENGTH 5 :PRINT-STRING-LENGTH :DEFAULT :SHOW-INTERNAL-FRAMES NIL :FORMAT :TRADITIONAL) 965
(7F25DE38DCE8) : 1 (FUNCALL #'#<(CCL::TRACED CCL::FD-CLOSE)> 4) 1269
(7F25DE38DD60) : 2 (%%IOBLOCK-CLOSE #S(CCL::IOBLOCK :STREAM #<BASIC-TCP-STREAM :CLOSED #x302001A4987D> :UNTYI-CHAR NIL ...)) 253
(7F25DE38DD88) : 3 (%%BEFORE-AND-AFTER-COMBINED-METHOD-DCODE (NIL #<CCL::STANDARD-KERNEL-METHOD CLOSE :AFTER #> . 17475077348298)) 1261
(7F25DE38DDF8) : 4 (%%STANDARD-COMBINED-METHOD-DCODE ((#<#>) (#<#>) #<CCL::STANDARD-KERNEL-METHOD CLOSE #>) 17475077348298) 261
(7F25DE38DE70) : 5 (%%CHECK-KEYWORDS #(1 #(:ABORT) #<Combined-Method CLOSE #x30200044706F>) 17475077348311) 261
(7F25DE38DED8) : 6 (DRAIN-TERMINATION-QUEUE) 517
(7F25DE38DF48) : 7 (FUNCALL #'#<(:INTERNAL CCL::ADD-GC-HOOK)>) 85
(7F25DE38DF68) : 8 (HOUSEKEEPING) 37
(7F25DE38DF78) : 9 (HOUSEKEEPING-LOOP) 517
(7F25DE38DFB8) : 10 (FUNCALL #'#<(:INTERNAL (TOPLEVEL-FUNCTION (CCL::LISP-DEVELOPMENT-SYSTEM T)))>) 269
<0 CCL::FD-CLOSE returned 0
Notice the call to drain-termination-queue
. By default, sockets in CCL are created with :auto-close t
. This means that when the GC can prove that the steam is unreachable, it will be closed automatically via a termination function. This is telling me that CCL thinks that the stream associated with file descriptor 4 has become unreachable more or less immediately after calling wsd:start-connection
.
By the way, you can invoke the gc manually to trigger the automatic draining of the termination queue. So, after doing (wsd:start-connection *client*)
, you can type (gc)
, and this will probably cause fd-close
to be called on the apparently now-unreachable stream.
When tracing is involved, there's probably enough extra consing to cause the gc to be invoked often. If tracing is off, then you probably have enough headroom to do the 4 or 5 writes you talk about before things start failing.
OK, one last thing today...
If you go to the usocket sources (e.g., .../quicklisp/dists/quicklisp/software/usocket-0.8.6) and edit backend/clozure.lisp to add :auto-close nil
to the openmcl-socket:make-socket call, I wonder if that will solve the problem.
diff --git a/backend/clozure.lisp b/backend/clozure.lisp
index 006f563..ae8b5c6 100644
--- a/backend/clozure.lisp
+++ b/backend/clozure.lisp
@@ -46,6 +46,7 @@
:format ,(to-format element-type protocol)
:external-format ,ccl:*default-external-format*
:deadline ,deadline
+ :auto-close nil
:nodelay ,nodelay
:connect-timeout ,timeout
:input-timeout ,timeout))))
I still don't understand what's causing CCL to think that the fd-stream in question has become unreachable, but I'm reluctant to point the finger at CCL's gc (or the termination mechanism) just yet.
Ah, great investigation and thank you for the tracing tips! That's pretty curious and the GC certainly explains the tracing/not tracing behavior.
If you go to the usocket sources (e.g., .../quicklisp/dists/quicklisp/software/usocket-0.8.6) and edit backend/clozure.lisp to add
:auto-close nil
to the openmcl-socket:make-socket call, I wonder if that will solve the problem.diff --git a/backend/clozure.lisp b/backend/clozure.lisp index 006f563..ae8b5c6 100644 --- a/backend/clozure.lisp +++ b/backend/clozure.lisp @@ -46,6 +46,7 @@ :format ,(to-format element-type protocol) :external-format ,ccl:*default-external-format* :deadline ,deadline + :auto-close nil :nodelay ,nodelay :connect-timeout ,timeout :input-timeout ,timeout))))
Tested the issue and found it to be present on both macOS and Linux. (x8664) Can confirm that the propsed solution works on both systems.
Many thanks to @SuperDisk for finding and @xrme for fixing.
I can see how it would be reasonable for usocket to use :auto-close nil since that might better match other implementations.
While it's certainly possible for CCL to have bugs (I see them all the time), I think it is more likely in this case that there is something in the application software code that is causing this behavior. I'm willing to be proved wrong, though.
FWIW, it seems that the issue isn't specific to websocket-driver
. I made a very simple "echo server" and connected to it directly with usocket
and it's exhibiting the same problem:
CL-USER>
(ql:quickload :usocket)
To load "usocket":
Load 1 ASDF system:
usocket
; Loading "usocket"
(:USOCKET)
CL-USER>
(defvar stream (usocket:socket-stream (usocket:socket-connect "localhost" 12345)))
STREAM
CL-USER> stream
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x30200105D6CD>
CL-USER> (gc)
NIL
CL-USER> stream
#<BASIC-TCP-STREAM ISO-8859-1 (NIL/4) #x302000DC6E8D>
CL-USER>
Interestingly, when I use CCL's socket API directly, it doesn't have the problem. So it must be something in usocket
? Just by a quick glance around the usocket
code though, I don't see anything out of the ordinary. This function is basically the only thing that gets called, and it just opens a socket and puts it in a thin wrapper object.... so I'm a bit stumped.
CL-USER> (defparameter sock (make-socket :remote-host "localhost" :remote-port 12345))
SOCK
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000DFE8ED>
CL-USER> (gc)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000D0365D>
CL-USER> (format sock "yeah")(force-output sock)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000D0365D>
CL-USER> (gc)
NIL
CL-USER> sock
#<BASIC-TCP-STREAM ISO-8859-1 (SOCKET/4) #x302000CFA73D>
CL-USER>
OK, I think I've tracked down what causes the bug.
usocket
's socket-connect-internal
function tries first to open an IPv6 socket. If this fails, ccl:socket-creation-error
is raised, which usocket
handles, then tries opening an IPv4 socket, which succeeds.
The problem happens when the failing IPv6 make-socket
call is allowed to happen. When it does, the successfully created socket will be have the same fd as the failed one, and when the old stream is GC'd, the fd is closed. If you take out the :internet6
attempt by modifying this line:
:for address-family :in '(:internet6 :internet)
to
:for address-family :in '(:internet)
everything works fine. Just to verify even further, I tried an alternative method, where I modified the socket creation to be this:
(if (eq address-family :internet6)
(error 'ccl:socket-creation-error)
(apply #'openmcl-socket:make-socket
...
and it fixes the problem.
So I think this must be a CCL bug. I don't know any implementation details so this is pure speculation, but I'm thinking maybe the failed stream gets put in the drain termination queue you mentioned earlier, then the (doomed because it has the same fd as the old one) successful stream is created.
So some problem in the case where ccl:socket-creation-error
is signaled, maybe. I'll investigate.
CCL is creating a stream before it calls connect(2) on the socket file descriptor. When connect(2) fails, ccl::make-tcp-socket
closes the file descriptor, but the the file descriptor (which is just a fixnum) remains in the stream's data structures.
Suppose we call socket(2), and get file descriptor 4. We make a ccl::fd-stream with that file descriptor. Now, we try to call connect on file descriptor 4. If the connect fails, the fd is now useless and must be closed, which we do in ccl::make-tcp-socket. However, we've left that dangling file descriptor 4 in the fd-stream.
We try again, calling socket(2), and getting file descriptor 4. (We just closed it, remember, so the system gives us the next available file descriptor, which is again number 4). This time, assume connect(2) works. We now have a working connection.
However, the stream we created before the failing call to connect(2) is still around, and it still contains a dangling reference to file descriptor 4. Because :auto-close
is t
by default on socket streams, when that earlier stream becomes unreachable, the GC will close that first stream. So, file descriptor 4 gets closed. But that pulls the rug out from under the second stream from the time when connect(2) worked, because it is using file descriptor 4.
Would you like to review/try the change in #505? If that change works, I think that usocket can be left alone (i.e., it wouldn't need to pass :auto-close nil
to ccl:make-socket
).
@SuperDisk, I like the fix in #505 enough that I merged it. If you can update your ccl and try it, I'll be interested to hear if this fixes the issue for you.
@metayan, it's now not necessary to change usocket to use :auto-close nil
to fix this issue. I think I'd prefer it to leave :auto-close
to its default value.
Just updated to latest and it works. I really appreciate investigation and work put into this, thanks.
I'm messing around with the echo server example from
websocket-driver
, and I found that it has issues with CCL but not with other implementations.After writing to a socket a couple times, the connection would inexplicably close-- I removed the handler which suppresses the error, and found that it's producing this:
The first few writes to the WebSocket work, and then after about 4 or 5 writes, this happens. The error comes from here in the call to
force-output
:https://github.com/Clozure/ccl/blob/4c6c0fc12b44575ac27cfa23172fcf70b5d38ff2/level-1/l1-streams.lisp#L5749
The call is here in
websocket-driver
: https://github.com/fukamachi/websocket-driver/blob/ff606c1ff373b6b07380978f848cf6b6aa192977/src/ws/client.lisp#L207Any idea what it could be?
Also, I just wanted to express my appreciation for your hard work maintaining this. I've been daily driving CCL while building a game and it's been quite nice.