MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #17773] html help cause high CPU usage #6947

Closed MichaelChirico closed 4 years ago

MichaelChirico commented 4 years ago

I have found the abnormal high CPU usage when opening html help by calling help.start() or from menu of Rgui on Windows 10. I have tested the same operations with R 3.6.3 and R 4.0.0, and the problem only be found in R 4.0.0.

1) Open the fresh installed R 4.0.0 by clicking the icon of Rgui on start menu. At this stage, the CPU used by Rgui is below about 1%.

2) Input help.start(), and a browser window appears. At this stage, the CPU used by Rgui increased to 25% (on my CORE i5 cpu with 4 cores).

3) Click the ``Packages'' link in the html help page. At this stage, the CPU used by Rgui increase to nearly 75%, and sometimes, 80% or more.

4) Close browser. The CPU used by Rgui does not change.

Here is the CPU usage by typeperf of Windows:

C:\Users\Jinso>typeperf "\Process(Rgui)\% Processor Time" -si 5 .... "04/28/2020 13:31:05.895","-1" "04/28/2020 13:31:10.915","0.622494" "04/28/2020 13:31:15.925","0.000000" "04/28/2020 13:31:20.935","0.000000" "04/28/2020 13:31:25.945","0.000000" "04/28/2020 13:31:30.965","2.801177" "04/28/2020 13:31:35.975","37.418638" "04/28/2020 13:31:40.982","100.159072" "04/28/2020 13:31:45.987","99.887614" "04/28/2020 13:31:50.992","99.899896" "04/28/2020 13:31:55.999","119.270784" "04/28/2020 13:32:01.002","299.014789" "04/28/2020 13:32:06.007","299.666184" "04/28/2020 13:32:11.015","294.905707" "04/28/2020 13:32:16.019","299.755921" "04/28/2020 13:32:21.025","299.924392" "04/28/2020 13:32:26.030","296.594191" "04/28/2020 13:32:31.048","298.195791" "04/28/2020 13:32:36.054","292.362986" "04/28/2020 13:32:41.064","292.503832" "04/28/2020 13:32:46.068","301.650976" ... ...

At about 10", I start Rgui, the CPU usage is about 0.∼1% (The second column number may be divided by 4).

At about 35", I call help.start() from menu of Rgui, you may notice that CPU usage increased to about 25%.

At about 55", I click on the ``Packages'' link in the browser. The CPU usage increased to about 70-75%, and closing browser does not have any effects on the CPU usage of R.

I have done the same thing with R 3.6.3.

The CPU usage does not change dramatically.

C:\Users\Jinso>typeperf "\Process(Rgui)\% Processor Time" -si 5 ... "04/28/2020 13:42:41.007","-1" "04/28/2020 13:42:46.017","-1" "04/28/2020 13:42:51.037","0.622240" "04/28/2020 13:42:56.048","0.000000" "04/28/2020 13:43:01.058","1.871148" "04/28/2020 13:43:06.069","3.436832" "04/28/2020 13:43:11.079","0.000000" "04/28/2020 13:43:16.099","0.000000" "04/28/2020 13:43:21.109","0.000000" "04/28/2020 13:43:26.119","3.432397" "04/28/2020 13:43:31.129","0.000000" "04/28/2020 13:43:36.139","0.935587" "04/28/2020 13:43:41.149","0.933886" "04/28/2020 13:43:46.168","1.244979" "04/28/2020 13:43:51.188","4.054443" "04/28/2020 13:43:56.209","0.000000" "04/28/2020 13:44:01.239","0.310505" ...

At 51", start Rgui, and 06" call help.start() from menu of Rgui, click on the ``Packages'' link, close browser, and so on. You may notice that there are no high CPU usage occurred.

Seems a bug in R 4.0.0...


METADATA

MichaelChirico commented 4 years ago

On R 4.0.0, if I get ``barplot'' html help by: ?barplot

It does not cause high CPU usage.

However, if I get ``plot'' html help by: ?plot

And in the browser to select one page, or click on the link of package, high CPU usage will appear.


METADATA

MichaelChirico commented 4 years ago
sessionInfo()

R version 4.0.0 (2020-04-24) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.936 [2] LC_CTYPE=Chinese (Simplified)_China.936 [3] LC_MONETARY=Chinese (Simplified)_China.936 [4] LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_China.936

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.0.0 tools_4.0.0


METADATA

MichaelChirico commented 4 years ago

You can reproduce this with R 4.0.0 on Windows 10 with html help:

Open a help page, eg, with ?mean, scroll to foot of page and click on Index. The index page (for base) appears to open normally (in Chrome and Firefox) but the browser reports "Waiting for 127.0.0.1"; the R instance is using 2 threads. This persists for at least 24hrs or until R is closed.

This is classified as "Component: Documentation" but is clearly more serious than a typo in help page.


METADATA

MichaelChirico commented 4 years ago

Created attachment 2588 [details] Demonstration of permanent cache incoherence


METADATA

INCLUDED PATCH

MichaelChirico commented 4 years ago

This issue is causing serious problems for several htmlwidget packages, not just spiking CPU usage but frequently failing to respond to HTTP requests:

https://github.com/ropensci/plotly/issues/1756 https://github.com/rstudio/leaflet/issues/678 https://github.com/rstudio/rstudio/issues/6786

We originally thought this was a problem in RStudio, but I now think it's due to an issue in Rhttpd.c. Our investigation is documented here:

https://github.com/rstudio/rstudio/issues/6737

The short version is that Rhttpd.c contains a static int in_process flag that is used to guard against reentrancy on non-Windows platforms (where Rhttpd is run from an input handler on the main thread). This is set to 1 before Rhttpd calls into R, and set back to 0 when it returns; and the Rhttpd input handler will simply return if in_process is on.

I believe this isn't needed on Windows, as a threaded implementation is used instead; but the in_process logic remains, and causes a race condition. The in_process flag is set on the main thread, but checked from the worker threads. The workers and main thread can have incoherent views of the value of in_process, and with -O3 optimizations on, I was surprised to see that this incoherence appears to be permanent--one or more of the worker threads can get stuck thinking in_process is 1, even minutes after the main thread has set it back to 0. See incoherence.c for a minimal repro; if compiled with -O3, this program hangs.

In this case, with in_process stuck at 1, the worker spins in a tight loop waiting for it to change, causing the high CPU, and whatever HTTP request caused the worker to run may be dropped.

I currently have two proposed fixes, either of which completely fixes the problem on my machine.

One is to declare the in_process variable as volatile, at least on Windows. This is a nice clean fix, the only slight downside is that the worker threads will still spin unnecessarily whenever in_process is 1. These should be for short durations though.

The other is to modify worker_input_handler so that if (in_process) return; is ifdef-ed away on _WIN32. My only concern here is whether it's possible for win32 messages to be pumped while tools:::httpd() is doing its work; I doubt it, but if so, then reentrancy is possible even on Windows.


METADATA

MichaelChirico commented 4 years ago

Created attachment 2589 [details] Repro of htmlwidget failing to load

This repro demonstrates an htmlwidget (usually) failing to load when served over Rhttpd. (This code is similar to a codepath used in RStudio's Viewer pane.)


METADATA

INCLUDED PATCH

MichaelChirico commented 4 years ago

Forgot to add that I don't know why this bug only surfaced with R 4.0.0, the code hasn't changed in Rhttpd.c for years and years. I thought maybe it was the new R 4.0 compiler toolchain, but the incoherence.c program repros with Rtools 3.5 as well.


METADATA

MichaelChirico commented 4 years ago

Thanks for analysing this!

The 'in_process' semaphore is pretty much the definition of a volatile variable, so declaring it as such sounds like the obvious solution.

Would it be possible to get the workers to spin less tightly? E.g., using a short wait instruction?


METADATA

MichaelChirico commented 4 years ago

[I sent this a a reply to r-bugs@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>-project.org but that bounced.]

(In reply to Peter Dalgaard from comment #8)

Thanks for analysing this!

Yes -- nice job!


The 'in_process' semaphore is pretty much the _definition_ of a volatile
variable, so declaring it as such sounds like the obvious solution. 

I would really want to read the fine print on volatile in a threaded context before committing to this.

As I recall, Windows does provide an atomic test/set integer and it might be better to look in that direction.

I haven't looked at this code but I am a bit nervous about what I've seen here. Using threads is OK, but it is critical that only one thread ever be in R internals at one time, and that jumps never try to go from one thread to another. It would be good to very carefully review for these issues before declaring this closed.

Would it be possible to get the workers to spin less tightly? E.g., using a
short wait instruction?

METADATA

MichaelChirico commented 4 years ago

Thanks, I can reproduce the bug and confirm the analysis, it is a race condition and in_process is the problem. I don't think a spinlock here makes sense, we should I think just use a regular blocking lock, and an existing one instead of our own.


METADATA

MichaelChirico commented 4 years ago
I haven't looked at this code but I am a bit nervous about what I've
seen here. Using threads is OK, but it is critical that only one thread
ever be in R internals at one time, and that jumps never try to go
from one thread to another. It would be good to very carefully review
for these issues before declaring this closed.

FYI, because it took me a while to understand how Rhttpd.c deals with this: if you look for SendMessage in that file, you'll see that there's a "wrapper" function for process_request that ensures execution happens only on the R thread. (The confusing part for me is that this wrapper function is also called process_request; a #define is used to rename the "real" process_request to process_request_main_thread.)


METADATA

MichaelChirico commented 4 years ago

Created attachment 2591 [details] Proposed fix

The attached patch uses two separate mechanisms:

[0] https://gist.github.com/jcheng5/dae9ad798bfb1b65c4ce21b1e6357cf9


METADATA

INCLUDED PATCH

MichaelChirico commented 4 years ago

Thanks for the patch and thanks again for spotting this in the first place. Now in R-devel (with minor modifications). It seems to be working in a trivial test but it would be great if you could run your tests on it and let us know if there were any issues.

"in_process", and now the mutex, is used in this code to avoid running too many reentrant eval()s inside R on behalf of the http server. Independently of this, these eval()s are always run by the main R thread on both platforms, on Windows because the window procedure is run by the same thread that creates the corresponding window, on Unix because there is only one thread.

While "volatile" fixed the observed failures, I confirm Luke's warning that it is not reliable for inter-process synchronization because it is not guaranteed to include a memory fence (a compiler might insert one, but does not have to and some compilers don't, on some hardware it won't be needed, but on some it would be), so it is not guaranteed the changes will be propagated properly by the hardware, so we've used the mutex patch, instead.

Re Unix, it seems "in_process" does not have to be "volatile". It is not set in a signal handler, it is not in a landing frame of a long jump, it is indeed not on memory mapped hardware, so we've left it non-volatile. Please let us know if we have overlooked something. Thanks!


METADATA

MichaelChirico commented 4 years ago

Thanks, Tomas, that's great news! I'm not sure why I thought input handlers could be invoked by signal handlers, I'm glad to hear they can't.

I've tested on Windows i386 and x64 with a couple of htmlwidgets and with a load tester, and no sign of the bug nor any reentrancy.


METADATA

MichaelChirico commented 4 years ago

Thanks, Joe, for the testing. I've ported the fix to R-patched, so it is ready for 4.0.1.

Input handlers must not be run from a signal handler, that would be a serious bug. They are run from the console/repl, from sleep, and primarily around calls to select(), neither of which can be in a signal handler. There is really little one can do safely in a signal handler.


METADATA