AngleSharp / AngleSharp.Css

:angel: Library to enable support for cascading stylesheets in AngleSharp.
https://anglesharp.github.io
MIT License
71 stars 34 forks source link

GetInnerText() performace #55

Open GeneThomas opened 4 years ago

GeneThomas commented 4 years ago

Bug Report

I am writing, what I would think is a fairly simple usage of AngleSharp[.Css], I am extracting a html table of covid-19 cases etc.. by country. The headers [or other cells] can contain html <br>. INode.Text() [an extension] and INode.TextContent() remove the <br> returning values like “TotalCases”. My implementation parses the 3000ish cells in 4.6 seconds. Using AngleSharp.Css’s ElementExtensions’s string GetInnerText(this IElement element); takes over 8 minutes makeing it unusable.

I assume you must implement Css’s display:none and visibility:hidden. I do not require that functionality, as I  do not require an implementation of Javascript. If GetInnerText()  can not be sped up a reasonable solution would be to use something like my code with your implementation of html entities such as © etc..

The attached project’s interesting code is in AngleSharpCssSpeedFault.cs. AngleSharpCssSpeedFault.zip

The last method InnerText(IElement) has a #if to switch between the two implementations of InnerText().

Prerequisites

Run the attached solution.

Description

see above

Steps to Reproduce

  1. Run the solution
  2. Change the #if in the last method InnerText()
  3. Run the solutino again.

Possible Solution

Use my InnerText() but add the expanding of all html & entities as that is missing.

Seyden commented 8 months ago

I debugged it and what slows it down is basically the computation of the style rules and because i also dont need styles for InnerText, except the default rules like paragraph or div break lines and stuff, i added 2 null checks.

In that case i can use InnerText without specifying .WithCss and without calling WithRenderDevice, this makes your code parse in 25 ms, instead of 8 minutes.

I will use my fork for now because this is probably not a acceptable solution for Florian