Parsing and tokenizing large views thrashes CPU, Memory & GC

SimonOrdo commented 8 years ago

I'm running RC1-Final. Pages are served fine, but with a large view, the server gets so bogged down, apparently trying to parse all the tags, that the server complains and I end up with HTTP errors at times.

In this instance, I'm serving a view with no razor "@" tags and no kind of inline code. Aside from two "section" tags, it's purely HTML and Javascript. The view defines several data tables and is 188kb in size. Here is what the start and end of requests look like in the VS performance tools (red arrows are start of request and end of request):

Memory snapshot before the page request

Memory snapshot during the page request

Memory snapshot after the page request

Any ideas how to mitigate this beyond "write smaller views with fewer tags to parse" ?

rynowak commented 8 years ago

/cc @Eilon this is exactly what we were discussing.

rynowak commented 8 years ago

The proposed fixes here:

Rename ISymbol to IToken - because these are tokens not symbols\ :trollface: Consider using an abstract base class here because there's going to plenty of functionality that's language-agnostic.

HTML-span based tokens should have a pointer to the source and a range (start, end) of the document. We can't afford to create copies of the HTML regions of the page as they are numerous, arbitrarily large and don't have a ton of duplication. If we can't do this without buffering (because of concurrency, multiple passes on the input), when we might want to substring here, but with de-duplication.

For CSharp-span based tokens, they will be extremely repetitive. We should dedupe these as well.

We should also consider creating a token-stream data structure, and storing in syntax nodes the range of tokens, rather than an explicit list.

For building of spans, it's the wrong approach to ToString() and then write it out. The span should hold on to lower-level constructs like tokens or nodes, and ask those items to write themselves out to a TextWriter. https://github.com/aspnet/Razor/blob/dev/src/Microsoft.AspNet.Razor/Chunks/ChunkTreeBuilder.cs#L144

There's some further reading here about Roslyn's implementation: http://ericlippert.com/2012/06/08/red-green-trees/ https://roslyn.codeplex.com/discussions/541953

The red-green tree discussion is potentially interesting if we want to do incremental parsing for editor scenarios, but in general we need to solve a different problem. They have a lot of text that's punctuation, or very repetitive.

rynowak commented 8 years ago

Updated with some more info

Eilon commented 8 years ago

@rynowak how related is this to https://github.com/aspnet/Razor/issues/614?

rynowak commented 8 years ago

@rynowak how related is this to #614?

They are both razor bugs we need to fix :+1: These are both issues you encounter with trying to use large razor pages that are usually completely blocking.

Eilon commented 8 years ago

Cool. Nicholas will take care of these!

dougbu commented 8 years ago

For building of spans, it's the wrong approach to ToString() and then write it out

Not sure whether this is part of the same bug but Razor makes extensive use of ToString(). Mostly this extra complication improves the debugging experience, eases testing, or supports tracing. But CSharpCodeWriter writes out literals char-by-char, calling ToString() on each. Would be much more efficient for the underlying CodeWriter to expose a Write(char) method.

NTaylorMullen commented 8 years ago

@rynowak had a great suggestion to optimize our usage of HtmlSymbols and take a more Roslyn approach when creating them (caching them). When working on the issue I discovered that Razor's SyntaxTree has gaps in it. Meaning you could have a SyntaxTreeNode spanning 0-10 of the document and then its sibling spanning 12-20 (leaving a gap 10-12). In order to cache HtmlSymbols I had to decouple them from their positions and errors and instead rely on their parents positions. I had to decide between two approaches:

Change the Razor parser to not create gaps in the SyntaxTree. This would also make the SyntaxTree more accurate.
Have the parser knowingly create SyntaxTreeNodes with a specified position (and leave the gaps).

The problem with 2 is with how the parser is written today logic code does not always know when/if it's creating a Span. Sooo to make it create Spans with specific positions it'd be a muchhhh larger re-design.

So I decided to attempt 1. In my branch I was able to get Razor into a runnable state for mainline scenarios and profiled cached HtmlSymbols vs non-cached for MSN's home page (very large). In the end the cached HtmlSymbols resulted in a 4% reduction in allocations.

Overall the change is 100% positive and enforces Razor's data structures to be more accurate; however, with RTM around the corner, optimizing HtmlSymbols for a 4% gain would be a very risky change! I had to touch nearly every part of the parser and would need to touch more if I were to fix all the edge case scenarios in my branch. Assuming I fixed every test in the Razor repo I'd be confident in our runtime / design time tests coverage. However, we lack very thorough integration tests, ones that'd be required for a drastic parser change. Also, tooling has expectations on the format of the SyntaxTree that we'd be invalidating.

@rynowak and I talked and we suggest pushing this out to post RTM, BUT should definitely be done because it is an extremely positive/enforcing change. The 4% that we see now would also become more apparent/larger as other parts of the parser are optimized.

/cc @Eilon

rynowak commented 8 years ago

RTM, BUT should definitely be done because it is an extremely positive/enforcing change. The 4% that we see now would also become more apparent/larger as other parts of the parser are optimized.

To add more to this, one of the best reasons to do this change in the future is that it will fix bugs we have and prevent us from re-introducing those bugs.

Agreed that this is probably more of a change that we want to take right now given the time it would take us to react to any bugs we'd introduce and the difficulty of testing the impact on editor/design-time.

SimonOrdo commented 8 years ago

@NTaylorMullen , @rynowak I see that this was merged with another issue. Can you clarify for me... will large views (like those with datatables) still have trouble being parsed as described in this original post or has that part of the issue been resolved

rynowak commented 8 years ago

will large views (like those with datatables) still have trouble being parsed as described in this original post

We're still working on this. Expect to see some concrete improvement in the next week or so.

All of the discussion here has fundamentally been about strategies to address the performance issues of parsing large pages. We investigated one avenue to address the issue and decided not to take that particular fix for this release.

Eilon commented 8 years ago

@rynowak from your previous comment it wasn't clear to me if you want this for RC2, RTM, or post-RTM?

rynowak commented 8 years ago

@Eilon - we should do the proposed design change for the tokenizer post RTM.

rynowak commented 8 years ago

@NTaylorMullen and I are breaking up the items and working through them.

Here's a baseline (15 iterations doing codegen on MSN.cshtml) CoreCLR x64:

Partial fix - 6a4a9544a14ac272841a4cc4845de0319b66b6c8 - Removing copies of empty RazorError[] - This is about 50mb

rynowak commented 8 years ago

ffdf5d2827c0b5d9087a583d90c703f5df6c389c

rynowak commented 8 years ago

95ea4cc06fab7c3296457b456ccbd9ba16b7ba3b

NTaylorMullen commented 8 years ago

7a0a9c6caa8dbba2be4b7709f914616416c0b3a2

rynowak commented 8 years ago

08fef959698ca70e3ef3445a3a3aac44b0fd703e

Eilon commented 8 years ago

Ok so do whatever we need now via this issue, and log a separate issue for whatever we want to do post-RTM.

rynowak commented 8 years ago

315d79ff2b94e98b7bbcc87e957ff029a7aed9b0

NTaylorMullen commented 8 years ago

1ce9180a3eace0134cc63f168dff57efc93cf294

rynowak commented 8 years ago

e68c55ab411c5bf5d1d416d0ebe8ca4be4927bf2

NTaylorMullen commented 8 years ago

We were able to reduce allocations by 42% and have filed the additional follow up issues to reduce it further:

https://github.com/aspnet/Razor/issues/674 https://github.com/aspnet/Razor/issues/675 https://github.com/aspnet/Razor/issues/676

aspnet / Razor

Parsing and tokenizing large views thrashes CPU, Memory & GC #635