alphapapa / plz.el

An HTTP library for Emacs
GNU General Public License v3.0
195 stars 10 forks source link

Workaround for sentinel-related race condition #53

Open josephmturner opened 3 months ago

josephmturner commented 3 months ago

plz-sentinel-race-condition-debug.txt

alphapapa commented 3 months ago

Thanks, Joseph!

For our future reference, what seems to be happening is that this call to accept-process-output is sometimes not resulting in the process sentinel being called before the plz function returns.

When that is so, the "hack" workaround starting in this form calls the sentinel first (currently, with the "finished\n" value hard-coded), and after the plz function returns, Emacs finally calls the sentinel with the real status (which, in our testing, happened when curl exited with code 7, a connection failure (as it so happens, an immediate one, since it was trying to connect to a port on localhost that was not being listened on)).

Then, since the sentinel was already called, and the process's buffer was already killed (by plz--respond, which the sentinel calls), when the sentinel calls plz--respond again for the same process, the with-current-buffer form, which calls set-buffer on the killed buffer, signals an error.

In our testing, this was only reproducible on Joseph's system, and then only 5-10% of the time. So it appears to be a race condition in Emacs's process-handling code.

With respect to plz, this seems to indicate that the aforementioned hack is incomplete, i.e. plz--sentinel does not check whether the process's result property has already been set, and it ends up calling plz--respond a second time.

Also, the hack of calling plz--sentinel with the hard-coded "finished\n" value presents a problem if the process did not finish successfully.

All together, the fixes are twofold:

  1. In plz--respond, to check the process-status and process-exit-status rather than the status string (which may be misleading). Since, when the hack is necessary, we don't know what status string Emacs would send to the sentinel, we can't rely on it to know the process's status. Luckily, the combination of process-status and process-exit-status seem equivalent to it, at least for our purposes.
  2. In plz--sentinel, to check whether the process's :plz-result property has already been set; and if so, to not call plz--respond again (because we've already called the sentinel and plz--respond for that request's process).

As well, this should probably be reported in an Emacs bug report (perhaps the one made a few years ago about this problem, which the hack was devised to work around). Unfortunately we still don't have a way to reproduce it consistently, but it seems that we can at least add to the problem description (i.e. that not only may it be that the process sentinel may not be called after the process has exited and all of its output has been accepted, but that it may be called after the top-level Elisp function returns--which may be fine for async requests, but not for synchronous ones (which may explain why we noticed and were able to reproduce the problem specifically with a synchronous request)).

alphapapa commented 3 months ago

@josephmturner The tests pass! https://github.com/alphapapa/plz.el/actions/runs/9233649184/job/25406336668

FYI, the changes: https://github.com/alphapapa/plz.el/compare/wip/53-sentinel-race-condition-workaround

josephmturner commented 3 months ago

Thank you!! I'm no longer able to reproduce the issue on my machine in normal hyperdrive.el usage or when letting Emacs hang for a few seconds with

(while t
  (ignore-errors (plz 'get "http://localhost:1337/")))

No error signaled!

The ":PLZ-RESULT ALREADY CHANGED" warning cropped up three times in that span, each time like so:

⛔ Debug (plz--sentinel [interpreted]): (FLOAT-TIME):1716621664.5090535 ":PLZ-RESULT ALREADY CHANGED" PROCESS:#<process plz-request-curl> STATUS:"exited abnormally with code 7
" (PROCESS-GET...):#s(plz-error (7 . "Failed to connect to host.") nil nil) 

Each time, it appears that the :plz-result process variable is being set to the plz-error struct.

This is great!! Thank you!!!

alphapapa commented 3 months ago

That's great news! Really, it's thanks to you for your help debugging it, and your persistence in wanting it fixed, despite the difficulty in reproducing it.

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

josephmturner commented 3 months ago

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Yes. Please also see https://github.com/alphapapa/plz.el/pull/54. I'm not sure that submitting a PR from my remote's wip/53-sentinel-race-condition-workaround branch to your remote's wip/53-sentinel-race-condition-workaround branch was the correct action. Feel free to close it.

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:

Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n
alphapapa commented 3 months ago

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:

Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

josephmturner commented 3 months ago

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

Yes, I have to manually stop the loop with C-g.

josephmturner commented 3 months ago

I'll close this issue now, since the workaround is doing the right thing!

alphapapa commented 3 months ago

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

Yes, I have to manually stop the loop with C-g.

That's good news!

Leaving the issue open to remind me to do something about the Emacs bug report. :)