Sarcasm / irony-mode

A C/C++ minor mode for Emacs powered by libclang
GNU General Public License v3.0
906 stars 99 forks source link

Any way to speed up Boost headers completion? #131

Open fspeech opened 9 years ago

fspeech commented 9 years ago

Once I included boost completion seems to slow down to a crawl. I am actually amazed that results eventually show up if you wait long enough. But this is not going to be very useful at the current speed. Is there anyway to speed up the completion, say with some pre-compiled headers? I saw this talked about but how is it actually done?

Sarcasm commented 9 years ago

Just a thought, once the completion seems to work fine on Windows you may want to byte-compile the file. That might, very slightly, improve performances.

I think you are mostly okay as you can see that emacs only calls the process filter if there are enough bytes or the pipe is flushed.

Apparently I don't force a flush when sending the EOT marker. I will do it just in case.

fspeech commented 9 years ago

Apparently I don't force a flush when sending the EOT marker. I will do it just in case.

That is interesting. The resolution isn't there with the current time logging but as far as one can tell the small transfers are nearly instantaneous. If the stream is not flushed how did Emacs manage to get those small transfers?

fspeech commented 9 years ago

This may be why the transfer is slow? Apparently the server is running in a line buffered mode? I see you are using start-process-shell-command in irony.el. That is why we can receive these small transfers. We should instead explicitly flush from the server and not wrap it in a shell so the subprocess can use a true pipe.

fspeech commented 9 years ago

you may want to byte-compile the file

Byte-compile would only help the initial loading of the file. Emacs always compile your functions into byte codes before executing them. So once a function is loaded you won't see any difference. (Edit: Reasonable yet embarrassingly wrong!) And any way Elpa already byte compiled all my irony code so they don't have to be recompiled every time they are loaded.

fspeech commented 9 years ago

so the subprocess can use a true pipe

Right now we have Emacs <- pipe -> shell <- tty -> server. The tty mode is line buffered. And presumably the shell would flush its pipe in a line buffered fashion as well which is why we see small transfers from the Emacs side.

Sarcasm commented 9 years ago

We should instead explicitly flush from the server and not wrap it in a shell so the subprocess can use a true pipe.

Will do. I will use start-process directly. I'm wondering if start-file-process could make irony-mode works over ssh but that's not the point right now.

Another thing is that, apparently undo is not disabled for the buffer, keeping an undo list may be costly. I will make sure to disable undo when the process is created, see C-h f buffer-disable-undo RET.

Byte-compile would only help the initial loading of the file. Emacs always compile your functions into byte codes before executing them. So once a function is loaded you won't see any difference. And any way Elpa already byte compiled all the irony code so they don't have to be recompiled every time they are loaded.

I would make sense but that's not how it works in Emacs. Byte-compiling makes a difference from what I have experienced on very elisp intensive code, before and after byte compiling.

Also the doc about byte compilation seems to agree: http://www.gnu.org/software/emacs/manual/html_node/elisp/Speed-of-Byte_002dCode.html#Speed-of-Byte_002dCode

fspeech commented 9 years ago

Also the doc about byte compilation seems to agree

Ah, thanks for the reference. Live and learn! I am surprised that these are two very different paths and regular elisp is not byte compiled on the fly. Apparently lisp/emacs-lisp/bytecomp.el does more structural optimization. While eval-defun through lisp/emacs-lisp/lisp-mode.el and src/eval.c just compiles the function to a call tree of builtins.

Will do.

I am definitely hoping for good news here.

Sarcasm commented 9 years ago

It's a little disappointing but I made some progresses on #132 and it doesn't seems to make any noticeable difference when using start-process instead of start-process-shell-command.

The timings are basically the same between both versions. I will continue with some other tweaking next week, test that I haven't broken anything on Linux and finally will push the changes if everything is okay.

One thing that bothers me, is that quite often at one point where the completion is triggered Emacs will just freeze. Not eating memory or CPU, it just stops responding. Not sure yet how to debug this but it's quite annoying. Did you notice anything like that?

By the way, what Emacs version are you using?

fspeech commented 9 years ago

I am using 24.3. Freezing happens to me when the completion takes a long time to transfer while I keep on typing. Killing the server has always unfrozen emacs for me so it is a deadlock somewhere not a crash for me. Last night company mode stopped working even after server restarted. I assume the state fell out of sync between company and irony. I haven't killed the session yet so let me probe some more see if I can find something. I don't know any better way to debug other than logging and reading the code :).

I am sorry to hear that the fix didn't work. But in principle we should be able to cut this xfer delay to very little. Is this delay present on Linux as well?

fspeech commented 9 years ago

If I understand it correctly

(defun irony--process-server-response (process response)
  (let ((sexp (read response))
        (callback (irony--server-process-pop-callback process)))
    (apply (car callback) sexp (cdr callback))))

we only pop the stack once for each response we get. So if we missed one response from the server crash the stack won't get emptied. I don't know if this affects the rest of the operations. In my case eventually still-completing-p always evaluates to true.

Anyway this is only related to post-crash recovery. It didn't tell us why the server crashed in the first place. As I recall when crashes happen the process buffer is midway through getting the response from the server. So it is possible that this is an emacs internal bug:

  1. apparently process filter is no longer called and it is on an internal loop of emacs not controlled by company/irony
  2. if server for some reason stopped sending stuff (as opposed to be blocked by pipe buffer being full) emacs UI should not freeze
  3. UI unfreezes when server is killed indicating emacs is busy waiting somewhere in the subprocess processing region

One possible workaround is to use temporary files instead of the pipe. It would be nice to get the pipe working but on Windows at least it was apparently not designed to handle this kind of large transfers. Cutting the size of the transfer could also reduce the chance of triggering the freeze.

Not sure yet how to debug this but it's quite annoying.

If it is an emacs internal bug it may be difficult to debug without being able to touch Emacs internals. I am not set up to build emacs (I use pre-built binaries). Does this only happen to you on Windows?

fspeech commented 9 years ago

if server for some reason stopped sending stuff (as opposed to be blocked by pipe buffer being full) emacs UI should not freeze

My assumption is wrong here. Since you allow more file requests to be sent while still receiving the results from previous ones, process-send will eventually block when the write buffer is full. Presumably when the server is busy sending it does not read its input. So if the server for some reason stopped emacs UI would eventually freeze on process-send. So this does not have to be an emacs internal bug. I am trying to understand what could cause this to happen. The server could stop if its write buffer gets full, which means emacs stopped reading from its side. That would mean the process filter is stopped from running for some reason (supposedly it can still run even while process-send blocks). Reading the process.c code I don't think process-put would block the filter either. So it is still head-scratching here.

fspeech commented 9 years ago

When you call process-send-* in lisp emacs eventually calls send_process in process.c . If the send can't complete it calls wait_reading_process_output, which calls read_and_dispose_of_process_output, which if there is new text received, calls the lisp process-filter. Now in the process-filter code we also call the callbacks if EOT is detected. If for some reason a callback triggers process-send again, which is blocked, a deadlock could eventually occur? But this doesn't seem to be what is going on since when emacs freezes there is a lot of stuff in the process buffer indicating the filter function stopped getting called midway through a transfer not at the end. However the process-send remains the primary suspect in causing the UI to freeze. The logic in wait_reading_process_output is complicated.

fspeech commented 9 years ago

One reason why the transfer is slow could be this: in the filter function we repeatedly search for EOT from the (point-min). For a 2M bytes transfer for which the filter will be called about 2M/4K=500 times we will on average search through 500M bytes of text. That could be slow. I will experiment with accumulating the text in a string instead of leaving it in the process buffer and see if that helps.

fspeech commented 9 years ago

I will experiment with accumulating the text in a string instead of leaving it in the process buffer and see if that helps.

Unfortunately that seems to hurt rather than help, by a significant margin. Next I will stay with the buffer but save the mark so we don't always search from the (point-min).

fspeech commented 9 years ago

When does this get run?

(defun irony-completion-post-command ()
  (when (and (memq this-command irony-completion-trigger-commands)
             (irony-completion--update-context)
             (irony-completion-at-trigger-point-p))
    (irony-completion--send-request)))

It does try to send a request. Which if blocked could cause a deadlock if it is triggered from the process filter? Last time my session froze I had this after killing the server:

Error in post-command-hook (irony-completion-post-command): (file-error "writing to process" "invalid argument" #<process Irony>)
Quit
irony process stopped!
Still completing Sat Oct 18 15:22:24 2014 [2 times]

Now I don't know if this was always the case and I was running a modified filter experimenting with the search markings. I will pay more attention to see if this happens again.

fspeech commented 9 years ago

Next I will stay with the buffer but save the mark so we don't always search from the (point-min).

The change was trivial to do but unfortunately it also didn't help. Emacs must be either very good at searching or it is secretly caching search results.

fspeech commented 9 years ago

I encountered the same sequence with freezing:

Sending file request Sat Oct 18 17:13:57 2014
Received first 4038 Sat Oct 18 17:13:57 2014
Sending file request Sat Oct 18 17:13:59 2014
Error in post-command-hook (irony-completion-post-command): (file-error "writing to process" "invalid argument" #<process Irony>)
irony process stopped!

It does look like it has something to do with the post-command-hook (why is it needed here?) that calls irony-completion-post-command which calls process-send. process-send is easily blocked since the write buffer is only a few hundred bytes. It should let the read output continue but somehow that didn't work so now we have a dead lock.

fspeech commented 9 years ago

(Forgot to press the submit button)

I got rid of the irony-completion-post-command from the post-command-hook, now I get freezing on company auto-begin. Progress :-( The real progress is that I can very reliably trigger the freeze with the following conditions:

  1. The response is large enough so it takes some time to finish
  2. The file being edited is big enough so process-send-region will cause a block
  3. When typing (auto-begin) causes 2. to happen before 1. is done, we have deadlock
Sending file request Sat Oct 18 18:09:19 2014
Received first 4038 Sat Oct 18 18:09:20 2014
Sending file request Sat Oct 18 18:09:22 2014
Company: An error occurred in auto-begin
writing to process: invalid argument, Irony
irony process stopped!

My guess is that this would happen on Linux as well, but there the transfer takes less time so condition 1 is hard to meet. I don't know if this is an Emacs bug or a known feature. I will post a question to the user group. For now we really need to lock out the sending of more requests until we are done receiving.

fspeech commented 9 years ago

I changed one line in the following:

(defun irony-completion-candidates-async (callback)
  "Call CALLBACK when asynchronous completion is available.

Note that:
 - If the candidates are already available, CALLBACK is called
   immediately.
 - In some circumstances, CALLBACK may not be called. i.e:
   irony-server crashes, ..."
  (irony-completion--update-context)
  (if (irony-completion-candidates-available-p)
      (progn
        (message (concat "Candidate available " (current-time-string)))
        (funcall callback))
    (when irony-completion--context
      (if (> (irony--server-process-callback-count (irony--get-server-process)) 0)
          (message (concat "Still completing " (current-time-string)))
        (irony-completion--send-request))
      (setq irony-completion--request-callbacks
            (cons callback irony-completion--request-callbacks)))))

Instead of calling irony-completion--still-completing-p I lock out further requests based on irony--server-process-callback-count and I no longer get the lock up. I may be losing some completion opportunities but at least now it is usable instead of locking up all the time. I think the post command hooks can be added back as well if that is useful.

fspeech commented 9 years ago

There is a bug#18420 that describes this problem for Emacs Windows port. It is supposedly fixed for Emacs 24.4.

Sarcasm commented 9 years ago

First thanks for you very very useful debugging, very much appreciated!

I'm a bit busy right now, I will not rush the code into master but I can push the progress I made on a separate branch if you want to try them out.

I am sorry to hear that the fix didn't work. But in principle we should be able to cut this xfer delay to very little. Is this delay present on Linux as well?

I don't know, when I'm done with the "Windows support" fixes I would like to make some measures on Linux as well to make sure performances haven't been degraded (the opposite will happen hopefully).

If it is an emacs internal bug it may be difficult to debug without being able to touch Emacs internals. I am not set up to build emacs (I use pre-built binaries). Does this only happen to you on Windows?

Yes it looks like a Windows-only bug, on Linux I do not have the issue.

When does this get run?

(defun irony-completion-post-command () ...)

I guess you may have already found out but the function is a post-command-hook, see C-h v post-command-hook RET for a description. It's roughly called after each keystroke and there I check if the completion context has changed, whether or not a new completion request should be made for the new context, ...

It does try to send a request. Which if blocked could cause a deadlock if it is triggered from the process filter? Last time my session froze I had this after killing the server:

This is not something that a process filter will call but that may be a good thing to keep this in mind.

There is a bug#18420 that describes this problem for Emacs Windows port. It is supposedly fixed for Emacs 24.4.

I tested with Emacs 24.4 and it seems to work fine now. I just hoped the official GNU Emacs FTP was offering a Windows release instead of taking it from an unknown third party.

I would be fine with to just put a warning on older Emacs version on Windows, to say that it's now really supported and make the minimum version required on Windows Emacs 24.4. Sadly the w32-pipe-read-delay hasn't been fixed in this release so irony-mode will still not work correctly out of the box, which is a bit sad.

Sarcasm commented 9 years ago

For information, testing with process-adaptive-read-buffering set to nil seems to have significantly better results for me, at least on big outputs.

With the following test file, numbers in parenthesis are the places completion was triggered.

#include <cstdlib>
#include <iostream>

#include <boost/asio.hpp>
#include <boost/bind.hpp>
#include <boost/array.hpp>
#include <boost/signal.hpp>
#include <boost/function.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/noncopyable.hpp>
#include <boost/aligned_storage.hpp>
#include <boost/enable_shared_from_this.hpp>

int main() {
  // boo| (1) ::| (2) asio::| (3) io;
  boost::asio::io_service io;

  // boo| (4) ::| (5) system::| (6) ec;
  boost::system::error_code ec;

  // io.| (7) run(ec);
  io.run(ec);

  if (ec) {
    // std| (8) ::| (9) cout << ec.|(10)
    std::cout << ec.message();
  }

  return 0;
}
* process-adaptive-read-buffering

  |    | t (default) |       nil |  delta |
  |----+-------------+-----------+--------|
  |  1 |   17.391000 | 15.867000 |  1.524 |
  |  2 |    0.842000 |  0.442000 |    0.4 |
  |  3 |    0.569000 |  0.370000 |  0.199 |
  |  4 |    3.645000 |  2.086000 |  1.559 |
  |  5 |    0.884000 |  0.442000 |  0.442 |
  |  6 |    0.312000 |  0.322000 |  -0.01 |
  |  7 |    0.320000 |  0.332000 | -0.012 |
  |  8 |    3.772000 |  2.090000 |  1.682 |
  |  9 |    1.182000 |  0.532000 |   0.65 |
  | 10 |    0.312000 |  0.312000 |     0. |
  |  1 |   17.495000 | 15.867000 |  1.628 |
  |  2 |    0.822000 |  0.444000 |  0.378 |
  |  3 |    0.725000 |  0.362000 |  0.363 |
  |  4 |    3.864000 |  2.130000 |  1.734 |
  |  5 |    0.820000 |  0.430000 |   0.39 |
  |  6 |    0.340000 |  0.322000 |  0.018 |
  |  7 |    0.342000 |  0.310000 |  0.032 |
  |  8 |    3.837000 |  2.090000 |  1.747 |
  |  9 |    1.182000 |  0.552000 |   0.63 |
  | 10 |    0.340000 |  0.322000 |  0.018 |
  |----+-------------+-----------+--------|
  |    |             |           | 13.372 |
  #+TBLFM: $4=$2-$3::$LR4=vsum(@2$4..@21$4)
fspeech commented 9 years ago

I'm a bit busy right now, I will not rush the code into master but I can push the progress I made on a separate branch if you want to try them out.

No problem. I think I would like to see server flushing at the end of a transfer (should be just a << endl instead of <<"\n" at the very end?) pushed into the master. That seems a safe thing to do.

I tested with Emacs 24.4 and it seems to work fine now. I just hoped the official GNU Emacs FTP was offering a Windows release instead of taking it from an unknown third party.

Good to know. It should be out in a few days and I am waiting for that :-(

I am still perplexed by the transfer speed. Maybe there is rate limiting issue as Emacs polls other UI events so it only reads one block a time. Based on the 24.4 code it looks like when the write is blocked (so there is no need to poll other events as the UI is blocked) it will try to read all the outstanding transfers at once. I wonder if you notice any speed variations from the timing logs with the new 24.4 Windows version: you should see the transfer time reduced when you are waiting for it (with overlapping requests allowed and where it used to freeze). If you don't have time for this that is okay. I should be getting the official new release in a few days.

Sarcasm commented 9 years ago

I wonder if you notice any speed variations from the timing logs with the new 24.4 Windows version

I haven't noticed any difference between Emacs 24.3 and 24.4 for the case where it didn't freeze. For the case where it froze, I do not have timing with Emacs 24.3 ^^.

fspeech commented 9 years ago

For information, testing with process-adaptive-read-buffering set to nil seems to have significantly better results for me, at least on big outputs.

That is good to know but I already had it set to nil :-(

For the timing log the initial 17 seconds wait looks painful Does that only happen once at the warm up (when preamble is made)? The other times look similar to mine. Clang seems very fast once the headers are compiled (with Asio the precompiled header clocked in at over 100MB for me). If only we can get Emacs to read faster!

Sarcasm commented 9 years ago

No problem. I think I would like to see server flushing at the end of a transfer (should be just a << endl instead of <<"\n" at the very end?) pushed into the master. That seems a safe thing to do.

Yes, you just need to flush, see the diff of my commit (local, not pushed yet).

diff --git a/server/src/main.cpp b/server/src/main.cpp
index 1eff2ef..aa4104f 100644
--- a/server/src/main.cpp
+++ b/server/src/main.cpp
@@ -168,7 +168,8 @@ int main(int ac, const char *av[]) {
       assert(0 && "unreacheable code...reached!");
       break;
     }
-    std::cout << "\n;;EOT\n";
+    std::cout << "\n;;EOT\n" << std::flush;
   }

   return 1;

For the timing log the initial 17 seconds wait looks painful Does that only happen once at the warm up (when preamble is made)?

Yes, it's the preamble building that take this much time.

The other times look similar to mine. Clang seems very fast once the headers are compiled (with Asio the precompiled header clocked in at over 100MB for me). If only we can get Emacs to read faster!

Yes, please not that the times I have given are a bit different than yours I think. I'm measure the whole time the request took (sending the request, processing the request with libclang, and finally wait for handle the request when all the results have been transferred.

Here is my commit if you want:

diff --git a/irony-completion.el b/irony-completion.el
index 86113b0..b716509 100644
--- a/irony-completion.el
+++ b/irony-completion.el
@@ -228,11 +228,15 @@ Return t if the context has been updated, nil otherwise."
           irony-completion--request-tick irony-completion--context-tick)
     (irony--send-file-request
      "complete"
-     (list 'irony-completion--request-handler irony-completion--context-tick)
+     (list 'irony-completion--request-handler irony-completion--context-tick (current-time))
      (number-to-string line)
      (number-to-string column))))

-(defun irony-completion--request-handler (candidates tick)
+(defun irony-completion--request-handler (candidates tick start-time)
+  (with-current-buffer (get-buffer-create "*Irony Completion Times*")
+    (insert (format "Request %d completed in %f seconds\n"
+                    tick
+                    (float-time (time-subtract (current-time) start-time)))))
   (when (eq tick irony-completion--context-tick)
     (setq
      irony-completion--candidates-tick tick

To be thorough I should probably separate the timings more but that seems enough to measure relative improvements.

fspeech commented 9 years ago

Thanks for the diffs!

BTW your new timing looks very different from the one earlier where you had about 10 seconds for transferring 2MBs (I have been seeing about 2 seconds). Are these tests on the same machine? If so that is major improvement! If I can get the same improvement it would be sub-second for me!

Sarcasm commented 9 years ago

I published a branch with my changes for Windows. You can give it a try. You still have to manually set the w32-pipe-read-delay to 0.

The branch is windows-support, you can find it here: https://github.com/Sarcasm/irony-mode/tree/windows-support

I will test that nothing has been broken on Linux, add some documentation and publish it after that.

BTW your new timing looks very different from the one earlier where you had about 10 seconds for transferring 2MBs (I have been seeing about 2 seconds). Are these tests on the same machine?

Yes this is the same machine.

If so that is major improvement! If I can get the same improvement it would be sub-second for me!

It would be nice, I'm not sure if that will happen, let me know. =)

Right now, this branch creates a buffer with some debug timings, you can check the buffer after some completions have been performed with M-x switch-to-buffer RET *Irony Completion Times* RET.

fspeech commented 9 years ago

I am using the new branch of the server and modified my irony.el to match and the new start-process without shell. The first shock was that my transfer time actually went up from 2 to 3 seconds. Granted that was not very precise due to rounding but it was clear that it was going up. It was at first very perplexing that the delay was going up after we removed a middle man. After pondering over it, I figured that the shell must use a fairly large internal buffer so the server side didn't block much at all. Now that the server is facing Emacs directly and if both have fairly small internal buffers (we know Emacs is only 4K) we are getting process switching delays on both sides. So I added the following to the very beginning of main:

diff --git a/server/src/main.cpp b/server/src/main.cpp
index 098806d..6ed804d 100644
--- a/server/src/main.cpp
+++ b/server/src/main.cpp
@@ -93,6 +93,9 @@ struct InteractiveCommandProvider : CommandProviderInterface {
 };

 int main(int ac, const char *av[]) {
+  char cout_buffer[65536];
+  std::cout.rdbuf()->pubsetbuf(cout_buffer, 65536);
+
   std::vector<std::string> argv(&av[1], &av[ac]);

   // stick to STL streams, no mix of C and C++ for IO operations

and now I am getting back the 2 seconds delay for 2MB transfers (so pathetic :-). We can use even larger buffers but the gain will likely get smaller. I think if you do the same you should see some gain on your numbers as well and you have more precise measurements. EDIT: See next comment this was a mistake.

I am still waiting for 24.4. Maybe I should build my own since it looks like a large input buffer will help on the Emacs side as well. Anyway I will wait and see if you get some more precise measurements.

fspeech commented 9 years ago

We can use even larger buffers but the gain will likely get smaller.

That turned out to be measurement mistake! I now use your more precise end-to-end timing I am getting readings that show that perceived difference was a mirage. As the buffer size goes up I actually get the first transfer later. So even though the transfer itself appears faster the total elapsed time didn't really change. Now I suspect that the server is not actually filling its buffer as fast as I thought which would explain why the buffer size and other things we did on the Emacs side didn't matter (after the initial w32-pipe-read-delay issue was resolved). So it may turn out that the server is the bottleneck now. We need data to see how fast the server really is without Emacs and compare that to numbers with Emacs.

EDIT: numbers invalid. See next comment.

These are all transfer sizes of around 2.2MB, with cout buffer set to 1M, 64K and default (no change).

fspeech commented 9 years ago

I made a mistake when I hand edited irony.el so that I was starting a new process each time. After fixing that the numbers are better (warmup time is taken out of everyone but the first request) but the conclusion does not change. I am still on 24.3.

default
Request 2 completed in 2.956000 seconds
Request 3 completed in 2.953000 seconds
64K
Request 2 completed in 2.947000 seconds
Request 3 completed in 2.861000 seconds
Sarcasm commented 9 years ago

I won't have access to my personal computer until Sunday but I'm curious, what are you testing with? A completion in the "global scope"?

Do you have acceptable timing when using completion in "scoped context" (member access, namespace members, ...)? Because, yes it's slow to complete at the global scope and maybe even to transfer, comparatively member accesses are relatively fast.

I will make different timings, one for the Clang times and then transfer times but in the meantime it is interesting to know whether you are testing the worst scenario or not.

For example, it would be nice to have your times on the same example as me: https://github.com/Sarcasm/irony-mode/issues/131#issuecomment-60008803

Just for the record, I tested my changes on Linux (but on another computer, I have the following timings):

Request 1 completed in 3.786876 seconds
Request 2 completed in 0.969953 seconds
Request 3 completed in 0.460809 seconds
Request 5 completed in 0.100374 seconds
Request 8 completed in 1.027903 seconds
Request 9 completed in 0.110647 seconds
Request 11 completed in 0.024847 seconds
Request 15 completed in 0.046195 seconds
Request 22 completed in 1.126866 seconds
Request 23 completed in 0.105979 seconds
Request 27 completed in 0.026710 seconds
fspeech commented 9 years ago

Yes the numbers are for the worst cases. About half of the time the completions are much smaller as you said. I chose the largest transfers as they were consistent at 2.2MB in transfer size and easy to compare. The 1 second on Linux sounds a lot better! Okay that is a good data point indicating the problem may still be with the transfer between Emacs and the irony server as I don't think clang would be that much slower on Windows.