Open josephmturner opened 3 months ago
Thanks, Joseph!
For our future reference, what seems to be happening is that this call to accept-process-output
is sometimes not resulting in the process sentinel being called before the plz
function returns.
When that is so, the "hack" workaround starting in this form calls the sentinel first (currently, with the "finished\n"
value hard-coded), and after the plz
function returns, Emacs finally calls the sentinel with the real status (which, in our testing, happened when curl exited with code 7, a connection failure (as it so happens, an immediate one, since it was trying to connect to a port on localhost that was not being listened on)).
Then, since the sentinel was already called, and the process's buffer was already killed (by plz--respond
, which the sentinel calls), when the sentinel calls plz--respond
again for the same process, the with-current-buffer
form, which calls set-buffer
on the killed buffer, signals an error.
In our testing, this was only reproducible on Joseph's system, and then only 5-10% of the time. So it appears to be a race condition in Emacs's process-handling code.
With respect to plz
, this seems to indicate that the aforementioned hack is incomplete, i.e. plz--sentinel
does not check whether the process's result property has already been set, and it ends up calling plz--respond
a second time.
Also, the hack of calling plz--sentinel
with the hard-coded "finished\n"
value presents a problem if the process did not finish successfully.
All together, the fixes are twofold:
plz--respond
, to check the process-status
and process-exit-status
rather than the status
string (which may be misleading). Since, when the hack is necessary, we don't know what status string Emacs would send to the sentinel, we can't rely on it to know the process's status. Luckily, the combination of process-status
and process-exit-status
seem equivalent to it, at least for our purposes.plz--sentinel
, to check whether the process's :plz-result
property has already been set; and if so, to not call plz--respond
again (because we've already called the sentinel and plz--respond
for that request's process).As well, this should probably be reported in an Emacs bug report (perhaps the one made a few years ago about this problem, which the hack was devised to work around). Unfortunately we still don't have a way to reproduce it consistently, but it seems that we can at least add to the problem description (i.e. that not only may it be that the process sentinel may not be called after the process has exited and all of its output has been accepted, but that it may be called after the top-level Elisp function returns--which may be fine for async requests, but not for synchronous ones (which may explain why we noticed and were able to reproduce the problem specifically with a synchronous request)).
@josephmturner The tests pass! https://github.com/alphapapa/plz.el/actions/runs/9233649184/job/25406336668
FYI, the changes: https://github.com/alphapapa/plz.el/compare/wip/53-sentinel-race-condition-workaround
Thank you!! I'm no longer able to reproduce the issue on my machine in normal hyperdrive.el
usage or when letting Emacs hang for a few seconds with
(while t
(ignore-errors (plz 'get "http://localhost:1337/")))
No error signaled!
The ":PLZ-RESULT ALREADY CHANGED"
warning cropped up three times in that span, each time like so:
⛔ Debug (plz--sentinel [interpreted]): (FLOAT-TIME):1716621664.5090535 ":PLZ-RESULT ALREADY CHANGED" PROCESS:#<process plz-request-curl> STATUS:"exited abnormally with code 7
" (PROCESS-GET...):#s(plz-error (7 . "Failed to connect to host.") nil nil)
Each time, it appears that the :plz-result
process variable is being set to the plz-error
struct.
This is great!! Thank you!!!
That's great news! Really, it's thanks to you for your help debugging it, and your persistence in wanting it fixed, despite the difficulty in reproducing it.
Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level
to :debug
and then re-evaluated the plz.el
file, right?
Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/")))
form to reproduce the problem before?
Quick question, to be sure I didn't make a mistake: that warning appeared because you bound
warning-minimum-log-level
to:debug
and then re-evaluated theplz.el
file, right?
Yes. Please also see https://github.com/alphapapa/plz.el/pull/54. I'm not sure that submitting a PR from my remote's wip/53-sentinel-race-condition-workaround
branch to your remote's wip/53-sentinel-race-condition-workaround
branch was the correct action. Feel free to close it.
Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?
No. Previously, I had repeatedly run eval-last-sexp
manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp
manually many times, so I pulled out all the stops. ;)
Just now, I evaluted plz.el
at the master branch (without the workaround) and then ran the same while
loop. Emacs hung for ~20s and then became responsive again without my having to press C-g
. The *Messages*
buffer displayed the following message 14 times, where NNN
was a different number each time:
Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n
Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?
No. Previously, I had repeatedly run
eval-last-sexp
manually. After applying the workaround, I was not able to trigger the warning by runningeval-last-sexp
manually many times, so I pulled out all the stops. ;)Just now, I evaluted
plz.el
at the master branch (without the workaround) and then ran the samewhile
loop. Emacs hung for ~20s and then became responsive again without my having to pressC-g
. The*Messages*
buffer displayed the following message 14 times, whereNNN
was a different number each time:Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n
Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g
? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)
Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press
C-g
? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)
Yes, I have to manually stop the loop with C-g
.
I'll close this issue now, since the workaround is doing the right thing!
Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press
C-g
? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)Yes, I have to manually stop the loop with
C-g
.
That's good news!
Leaving the issue open to remind me to do something about the Emacs bug report. :)
plz-sentinel-race-condition-debug.txt