codemirror / codemirror5

In-browser code editor (version 5, legacy)
http://codemirror.net/5/
MIT License
26.69k stars 4.96k forks source link

Performance slowdown with Tokenizer #1847

Closed redmunds closed 10 years ago

redmunds commented 10 years ago

Brackets uses the Tokenizer is used to parse HTML doc to extract JS from <script> tags for Code Hints. We recently tracked down a slowdown in the Tokenizer due to this change: https://github.com/marijnh/CodeMirror/commit/ba21985bfa7c2b8ad6b1ecd564a41b06083d2475

Can you help me understand what that change fixes so we can try to fine tune that change to get the performance back?

Or maybe you can suggest a better method for extracting content from a tag in an htmlmixed page? Note that we're also seeing this slowdown for <style> tags.

redmunds commented 10 years ago

Note that the original bug in the referenced issue is different from what I described above. Start with this comment if you care to look at the Brackets issue.

marijnh commented 10 years ago

Could you set up a simple demo that shows A) what you're doing, and B) the bad performance.

redmunds commented 10 years ago

Demo is simple, but not small -- bug is loading JS for code hinting in Brackets. This is the test I am using in Brackets on Win7:

  1. Unzip this Reveal.js presentation: https://dl.dropboxusercontent.com/u/28581497/reveal.js-2.0_test.zip (447KB)
  2. Open file: reveal.js-2.0/hint-test.html
  3. On line 27, you'll see: //window
  4. Remove slashes to uncomment line, put cursor following window
  5. Type "." and wait for code hints to appear.

Note: Worst performance is on initial load, then much better on subsequent hinting due to caching. To run test a second time, switch to another file & back to this one, or close & re-open file.

Using Chrome Dev Tools profiler, the time in CodeMirror.getTokenAt() doubled between Brackets Sprint 29 & 30

Sprint 29: 717ms Sprint 30: 1497ms

You can get Brackets downloads by going to http://download.brackets.io/ and clicking on "All Downloads" button.

marijnh commented 10 years ago

Okay, but that is a big project loaded into another big project, not a small demo. I need to see how you are using getTokenAt. Can you reproduce the issue in a simple static HTML page with a small script that does something equivalent to what you are doing in brackets? Failing that, can you at least point me at the brackets code that is making the calls to getTokenAt?

marijnh commented 10 years ago

(And some background: The patch you linked to concerns a behavior where, if a token is requested at a point that was never parsed before, CM will try to find a parser state for a line before that. It will scan a certain range, and if it doesn't find a state, will simply use the start state, so that it doesn't have to synchronously parse the whole document. However, for mixed modes, there is no obvious way to determine whether we are in a sub-mode, so it will use a larger range, to increase the chances of seeing the token that started the sub-mode. The computed state will be saved, so that subsequent calls to get nearby tokens will reuse it. The only way I can see this making a large difference is if you're making multiple calls to getTokenAt in a really big document, so that the computed ranges somehow don't overlap, or if you're doing something like setOption("mode", ...) in betwen getTokenAt calls, which would clear all the cached parser state. Or is only the first call slow?)

redmunds commented 10 years ago

You asked for a simple demo, not necessarily a small demo :)

In general, the code is synchronously parsing an entire htmlmixed document. The purpose is to extract JS from <script> blocks and maintain line numbers as it goes.

It starts at the beginning of the doc and calls getTokenAt() until it hits the end. I have only looked at the average time for calls to getTokenAt(), so I am not sure if the first call takes significantly longer.

The code calls CodeMirror.innerMode(outerMode, ctx.token.state).mode.name in the loop for each token, so setOption("mode", ...); could be getting called implicitly by some other method.

It would take me a while to put together a similar demo in an HTML page, so hopefully you can take a quick look at the code.

The main loop is in the HTMLUtils.findBlocks() function: https://github.com/adobe/brackets/blob/master/src/language/HTMLUtils.js#L476

The TokenUtils.moveNextToken() function is defined here: https://github.com/adobe/brackets/blob/master/src/utils/TokenUtils.js#L76

Any advice on better techniques for parsing an htmlmixed (or any other mode) document would be appreciated. Thanks.

marijnh commented 10 years ago

Aha, the problem is the precise flag that Narciso added in [1] . This does not play well with the caching mechanism.

I'd suggest that if you're going to traverse the whole document anyway, you use the runmode addon 2 to do your own iteration, which is going to be a number of orders of magnitude more efficient than getTokenAt.

[1]: https://github.com/marijnh/CodeMirror/commit/52558bf921b61ac194b1e71be4ab3c8cda50aa06

marijnh commented 10 years ago

See also https://github.com/marijnh/CodeMirror/commit/99192a96f44e55eb395ad007330168e24d866c8d , which fixes the problem that caused this slowdown, and will probably help a lot here. But I'd still recommend using runmode instead.

On Tue, Oct 1, 2013 at 5:44 PM, Marijn Haverbeke marijnh@gmail.com wrote:

Aha, the problem is the precise flag that Narciso added in [1] . This does not play well with the caching mechanism.

I'd suggest that if you're going to traverse the whole document anyway, you use the runmode addon 2 to do your own iteration, which is going to be a number of orders of magnitude more efficient than getTokenAt.

[1]: https://github.com/marijnh/CodeMirror/commit/52558bf921b61ac194b1e71be4ab3c8cda50aa06

redmunds commented 10 years ago

Yes, the precise flag is the culprit. cc: @njx

Time spent in getTokenAt() dropped from 1523ms to 82ms! What a difference a cache makes :)

I'll also pass on the info about addon_runmode to the team.

Thanks! Closing.

njx commented 10 years ago

Ah. Originally the intent of the flag was only to be used in cases where we were looking up a single token, not in scanning the whole document. But it makes sense to reset the frontier afterwards to make future lookups faster.