AngleSharp / AngleSharp

:angel: The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.
https://anglesharp.github.io
MIT License
5.16k stars 562 forks source link

Text extraction interpreting <br> #877

Closed GeneThomas closed 4 years ago

GeneThomas commented 4 years ago

New Feature Proposal

When extracting information from web pages, such as the main table #main_table_countries_today in https://www.worldometers.info/coronavirus/ one wants the text as the user sees it: \n and \t converted to space,
as \n; html entities expanded, e.g. > → >; multiple spaces to one space and text Trim().

Presently the .Text() and .TextContent of the second header cells are “TotalCases” as the <br> has been removed. Implementing this on top of Angle Sharp is not ideal as one has to implement all of the html entities such as $quot.

If you do not want to break existing users of .Text [you could since you have not got to v1.0 yet] a property called .UserText would be good.

The following code does most of that that I requrie:

    static string ExtractText(IElement elem)
    {
        string str = elem.InnerHtml;
        str = str.Replace("\n", " ");
        str = str.Replace("<br>", "\n");
        str = str.Replace("&nbsp;", " ");
        str = Regex.Replace(str, "<[^>]+>", ""); // remove html elements
        str = Regex.Replace(str, " +", " "); // multiple spaces to one space
        str = str.Replace("&gt;", ">");
        str = str.Replace("&lt;", "<");
        str = str.Replace("&quot;", "\"");
        str = str.Replace("&apos;", "'");
        str = str.Replace("&amp;", "&");
        // anu other html entity is possible
        str = str.Trim();
        return str;
    }

Attached is a Visual Studio C# project that shows the feature, the user sees:

one
two

but .Text() and .TextContent return “onetwo”:

AngleSharpBrFault.zip .

GeneThomas commented 4 years ago

Small tweak to text extraction code to match web browsers’ “one <br> two” -> “one\ntwo”

static string ExtractText(IElement elem)
    {
        //elem.Text;
        //elem.TextContent

        string str = elem.InnerHtml;
        str = Regex.Replace(str, "\\s", " ");
        str = str.Replace("&nbsp;", " ");
        str = str.Replace("<br>", "\n");
        str = Regex.Replace(str, "<[^>]+>", "");  // remove elements
        str = Regex.Replace(str, " +", " ");      // many spaces are one space
        str = Regex.Replace(str, " *\n *", "\n"); // “one <br> two” -> “one\ntwo”
        str = str.Replace("&gt;", ">");
        str = str.Replace("&lt;", "<");
        str = str.Replace("&quot;", "\"");
        str = str.Replace("&apos;", "'");
        str = str.Replace("&amp;", "&");
        // any other html &entity; possible :o(
        str = str.Trim(' '); // leave leading and trailsing \n (<br>)s
        return str;
    }
FlorianRappl commented 4 years ago

Have you tried the CSS extension innerText?

It's available when you include AngleSharp.Css.

GeneThomas commented 4 years ago

Hello:

I have:

using AngleSharp.Css;

...

private static string ExtractText(IElement elem)
{
  // there is no elem.InnerText?

}

On Mon, 4 May 2020 at 18:15, Florian Rappl notifications@github.com wrote:

Have you tried the CSS extension innerText https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText?

It's available when you include AngleSharp.Css.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623277352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBLTALMQQI7PZIEFL3RPZMNPANCNFSM4MYLXVPQ .

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

FlorianRappl commented 4 years ago

It's GetInnerText; innerText is the IDL / JS name. C# does not have extension properties.

Hope that helps!

GeneThomas commented 4 years ago

Are you sure that that is found by using AngleSharp.Css? I'm using that but IElement.GetInnerText is not found?

On Tue, 5 May 2020 at 17:10, Florian Rappl notifications@github.com wrote:

It's GetInnerText; innerText is the IDL / JS name. C# does not have extension properties.

Hope that helps!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623859032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOC7SLIA7XCHLMWSRMTRP6NRVANCNFSM4MYLXVPQ .

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

FlorianRappl commented 4 years ago

Depends what you mean by "using AngleSharp.Css". Using as in "referencing" or as in "putting in the using keyword"?

https://github.com/AngleSharp/AngleSharp.Css/blob/master/src/AngleSharp.Css/Extensions/ElementExtensions.cs#L41

You need to reference AngleSharp.Css, properly add it to your configuration, and using the AngleSharp.Dom namespace.

GeneThomas commented 4 years ago

Thanks, I understand now. Having AngleSharp.Css as a namespace in AngleSharp.dll will confuse other users too. In fact is there a reason to have multiple projects/dll, the size of packages is not important these days, just the developer's time having to deal with multiple projects/dlls. Do I need to initialse AngleSharp.Css. The package is throwing from within GetInnerText() in v 0.14 and the master branch:

AngleSharp.Css.dll!AngleSharp.Css.Dom.CssStyleDeclaration.ChangeDeclarations(System.Collections.Generic.IEnumerable decls, System.Predicate defaultSkip, System.Func<AngleSharp.Css.Dom.ICssProperty, AngleSharp.Css.Dom.ICssProperty, bool> removeExisting) Line 362 C# AngleSharp.Css.dll!AngleSharp.Css.Dom.CssStyleDeclaration.SetDeclarations(System.Collections.Generic.IEnumerable decls) Line 309 C# AngleSharp.Css.dll!AngleSharp.Css.StyleCollectionExtensions.ComputeCascadedStyle(System.Collections.Generic.IEnumerable styleCollection, AngleSharp.Dom.IElement element, AngleSharp.Css.Dom.ICssStyleDeclaration parent) Line 90 C# AngleSharp.Css.dll!AngleSharp.Css.StyleCollectionExtensions.ComputeDeclarations(System.Collections.Generic.IEnumerable rules, AngleSharp.Dom.IElement element, string pseudoSelector) Line 58 C# AngleSharp.Css.dll!AngleSharp.Dom.WindowExtensions.GetComputedStyle(AngleSharp.Dom.IWindow window, AngleSharp.Dom.IElement element, string pseudo) Line 77 C# AngleSharp.Css.dll!AngleSharp.Dom.CssApiExtensions.ComputeCurrentStyle(AngleSharp.Dom.IElement element) Line 25 C# AngleSharp.Css.dll!AngleSharp.Dom.ElementExtensions.GetInnerText(AngleSharp.Dom.IElement element) Line 52 C# Covid19ByContinent.exe!Covid19ByContinent.ExtractText(AngleSharp.Dom.IElement elem) Line 371 C# Covid19ByContinent.exe!Covid19ByContinent.ExtractTable(string html, string tableSelector, int keyColumn) Line 424 C# Covid19ByContinent.exe!Covid19ByContinent.Main(string[] args) Line 107 C#

In public static ICssStyleDeclaration ComputeCascadedStyle(this IEnumerable styleCollection, IElement element, ICssStyleDeclaration parent = null) in src\AngleSharp.Css\Extensions\StyleCollectionExtensions.cs element.GetStyle() returns null, later that null is used and throws.

Thanks for the assistance...,

Gene Thomas.

On Tue, 5 May 2020 at 18:54, Florian Rappl notifications@github.com wrote:

Depends what you mean by "using AngleSharp.Css". Using as in "referencing" or as in "putting in the using keyword"?

https://github.com/AngleSharp/AngleSharp.Css/blob/master/src/AngleSharp.Css/Extensions/ElementExtensions.cs#L41

You need to reference AngleSharp.Css, properly add it to your configuration, and using the AngleSharp.Dom namespace.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623885526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOCIHZBRODMVN33GVHLRP6ZYJANCNFSM4MYLXVPQ .

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

FlorianRappl commented 4 years ago

Sorry but you need to show your configuration. I assume you did not configure it properly.

I explicitly wrote:

You need to reference AngleSharp.Css, properly add it to your configuration, and using the AngleSharp.Dom namespace.

I guess you asked here:

Do I need to initialse AngleSharp.Css.

Well, the README is quite explicit here.

var config = Configuration.Default
    .WithCss(); // from AngleSharp.Css

(https://github.com/AngleSharp/AngleSharp.Css#basic-configuration, its pretty much the first thing there)

Also having multiple projects is important, because not everyone wants to have CSS in their config. The lib size may not matter to you, but it certainly matters. Don't infer from your use case to the majority.

I'll close it for now.

GeneThomas commented 4 years ago

It works now, the config that works has both WithCss() and WithRenderDevice():

    IConfiguration  config = Configuration.Default
        .WithCss()
        .WithRenderDevice(new DefaultRenderDevice
        {
            DeviceHeight = 768,
            DeviceWidth = 1024,
        });

Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?

Yours Sincerely,

Gene Thomas.

On Tue, 5 May 2020 at 19:51, Florian Rappl notifications@github.com wrote:

Closed #877 https://github.com/AngleSharp/AngleSharp/issues/877.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#event-3303060349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBRXUMZOZFYT7CBWU3RP7APTANCNFSM4MYLXVPQ .

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

GeneThomas commented 4 years ago

Hello, Thanks for your assistance so far, I thought the reasons I am using AngleSharp may interest you, I am crunching numbers for the Covid-19 pandemic along cultural bounds, i.e. grouping similar countries such as the Middle East:

Continent Population % Cases Deaths Tests Deaths/Cases Cases/M Deaths/M Tests/M Relative Cases/M Relative Deaths/M Relative Tests/M ──────────────── ───────────── ─────── ───────── ─────── ────────── ──────────── ──────── ──────── ───────── ──────────────────── ──────────────────── ──────────────────── Europe 625,823,853 8.03% 1,414,759 143,608 14,384,048 10.15% 2,260.63 229.47 22,984.18 █████████████ ████████████████████ █████████████████ North America 368,744,805 4.73% 1,273,607 73,775 8,381,799 5.79% 3,453.90 200.07 22,730.62 ████████████████████ █████████████████▌ █████████████████ Latin America 649,376,336 8.33% 270,985 14,363 2,033,210 5.30% 417.30 22.12 3,131.02 ██▌ ██ ██▌ Middle East 502,303,294 6.44% 210,382 8,267 3,419,502 3.93% 418.83 16.46 6,807.64 ██▌ █▌ █████ Russia etc.. 300,517,069 3.85% 195,736 2,137 5,818,812 1.09% 651.33 7.11 19,362.67 ████ ▌ ██████████████▌ Australia and Nz 30,322,117 0.39% 8,333 116 818,320 1.39% 274.82 3.83 26,987.56 █▌ ▌ ████████████████████ Asia 2,346,060,124 30.09% 160,207 7,104 2,173,275 4.43% 68.29 3.03 926.35 ▌ ▌ India etc.. 1,817,448,317 23.31% 79,466 2,243 1,527,972 2.82% 43.72 1.23 840.72 ▌ Africa 1,145,344,967 14.69% 29,792 688 656,983 2.31% 26.01 0.60 573.61 ▌ Polynesia 11,495,169 0.15% 26 0 3,409 0.00% 2.26 0.00 296.56 ──────────────── ───────────── ─────── ───────── ─────── ────────── ──────────── ──────── ──────── ───────── ──────────────────── ──────────────────── ──────────────────── World 7,797,436,051 100.00% 3,643,293 252,301 39,217,330 6.93% 467.24 32.36 5,029.52 ════════════════ ═════════════ ═══════ ═════════ ═══════ ══════════ ════════════ ════════ ════════ ═════════ ════════════════════ ════════════════════ ════════════════════

Yours Sincerely,

Gene Thomas

On Tue, 5 May 2020 at 20:45, Gene Thomas gene@genethomas.com wrote:

It works now, the config that works has both WithCss() and WithRenderDevice():

    IConfiguration  config = Configuration.Default
        .WithCss()
        .WithRenderDevice(new DefaultRenderDevice
        {
            DeviceHeight = 768,
            DeviceWidth = 1024,
        });

Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?

Yours Sincerely,

Gene Thomas.

On Tue, 5 May 2020 at 19:51, Florian Rappl notifications@github.com wrote:

Closed #877 https://github.com/AngleSharp/AngleSharp/issues/877.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#event-3303060349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBRXUMZOZFYT7CBWU3RP7APTANCNFSM4MYLXVPQ .

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

-- Gene Thomas

021 436384 http://genethomas.com gene@genethomas.com

FlorianRappl commented 4 years ago

Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?

I cannot help you in full detail unless you show your config. For instance, it makes a big difference if you have resource requesting activated. Evaluating the style sheets is costly as render tree evaluations are much more expensive than DOM tree evaluations.

Usually, even for large sites AngleSharp evaluates in the fraction of a second and also AngleSharp.Css is not taking so long. If you think it takes too long we first need to know where the time in spent (e.g., style sheet evaluation(s) or the getInnerText call - if the latter, where in there... can we optimize?).

Coming back to the OP - I assume now that getInnerText satisfies your needs. If perf. can / should be improved I would suggest opening a new item on AngleSharp.Css. This way we can track progress properly and have a disucssion with the folks there.

Remark: Indeed the use case is great and I fully appreciate the info! Wonderful stuff! :rocket: