Closed GeneThomas closed 4 years ago
Small tweak to text extraction code to match web browsers’ “one <br> two” -> “one\ntwo”
static string ExtractText(IElement elem)
{
//elem.Text;
//elem.TextContent
string str = elem.InnerHtml;
str = Regex.Replace(str, "\\s", " ");
str = str.Replace(" ", " ");
str = str.Replace("<br>", "\n");
str = Regex.Replace(str, "<[^>]+>", ""); // remove elements
str = Regex.Replace(str, " +", " "); // many spaces are one space
str = Regex.Replace(str, " *\n *", "\n"); // “one <br> two” -> “one\ntwo”
str = str.Replace(">", ">");
str = str.Replace("<", "<");
str = str.Replace(""", "\"");
str = str.Replace("'", "'");
str = str.Replace("&", "&");
// any other html &entity; possible :o(
str = str.Trim(' '); // leave leading and trailsing \n (<br>)s
return str;
}
Have you tried the CSS extension innerText?
It's available when you include AngleSharp.Css.
Hello:
I have:
using AngleSharp.Css;
...
private static string ExtractText(IElement elem)
{
// there is no elem.InnerText?
}
On Mon, 4 May 2020 at 18:15, Florian Rappl notifications@github.com wrote:
Have you tried the CSS extension innerText https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText?
It's available when you include AngleSharp.Css.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623277352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBLTALMQQI7PZIEFL3RPZMNPANCNFSM4MYLXVPQ .
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
It's GetInnerText
; innerText
is the IDL / JS name. C# does not have extension properties.
Hope that helps!
Are you sure that that is found by using AngleSharp.Css? I'm using that but IElement.GetInnerText is not found?
On Tue, 5 May 2020 at 17:10, Florian Rappl notifications@github.com wrote:
It's GetInnerText; innerText is the IDL / JS name. C# does not have extension properties.
Hope that helps!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623859032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOC7SLIA7XCHLMWSRMTRP6NRVANCNFSM4MYLXVPQ .
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
Depends what you mean by "using AngleSharp.Css". Using as in "referencing" or as in "putting in the using
keyword"?
You need to reference AngleSharp.Css
, properly add it to your configuration, and using
the AngleSharp.Dom
namespace.
Thanks, I understand now. Having AngleSharp.Css as a namespace in AngleSharp.dll will confuse other users too. In fact is there a reason to have multiple projects/dll, the size of packages is not important these days, just the developer's time having to deal with multiple projects/dlls. Do I need to initialse AngleSharp.Css. The package is throwing from within GetInnerText() in v 0.14 and the master branch:
AngleSharp.Css.dll!AngleSharp.Css.Dom.CssStyleDeclaration.ChangeDeclarations(System.Collections.Generic.IEnumerable
In public static ICssStyleDeclaration ComputeCascadedStyle(this
IEnumerable
Thanks for the assistance...,
Gene Thomas.
On Tue, 5 May 2020 at 18:54, Florian Rappl notifications@github.com wrote:
Depends what you mean by "using AngleSharp.Css". Using as in "referencing" or as in "putting in the using keyword"?
You need to reference AngleSharp.Css, properly add it to your configuration, and using the AngleSharp.Dom namespace.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#issuecomment-623885526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOCIHZBRODMVN33GVHLRP6ZYJANCNFSM4MYLXVPQ .
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
Sorry but you need to show your configuration. I assume you did not configure it properly.
I explicitly wrote:
You need to reference AngleSharp.Css, properly add it to your configuration, and using the AngleSharp.Dom namespace.
I guess you asked here:
Do I need to initialse AngleSharp.Css.
Well, the README is quite explicit here.
var config = Configuration.Default
.WithCss(); // from AngleSharp.Css
(https://github.com/AngleSharp/AngleSharp.Css#basic-configuration, its pretty much the first thing there)
Also having multiple projects is important, because not everyone wants to have CSS in their config. The lib size may not matter to you, but it certainly matters. Don't infer from your use case to the majority.
I'll close it for now.
It works now, the config that works has both WithCss() and WithRenderDevice():
IConfiguration config = Configuration.Default
.WithCss()
.WithRenderDevice(new DefaultRenderDevice
{
DeviceHeight = 768,
DeviceWidth = 1024,
});
Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?
Yours Sincerely,
Gene Thomas.
On Tue, 5 May 2020 at 19:51, Florian Rappl notifications@github.com wrote:
Closed #877 https://github.com/AngleSharp/AngleSharp/issues/877.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#event-3303060349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBRXUMZOZFYT7CBWU3RP7APTANCNFSM4MYLXVPQ .
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
Hello, Thanks for your assistance so far, I thought the reasons I am using AngleSharp may interest you, I am crunching numbers for the Covid-19 pandemic along cultural bounds, i.e. grouping similar countries such as the Middle East:
Continent Population % Cases Deaths Tests Deaths/Cases Cases/M Deaths/M Tests/M Relative Cases/M Relative Deaths/M Relative Tests/M ──────────────── ───────────── ─────── ───────── ─────── ────────── ──────────── ──────── ──────── ───────── ──────────────────── ──────────────────── ──────────────────── Europe 625,823,853 8.03% 1,414,759 143,608 14,384,048 10.15% 2,260.63 229.47 22,984.18 █████████████ ████████████████████ █████████████████ North America 368,744,805 4.73% 1,273,607 73,775 8,381,799 5.79% 3,453.90 200.07 22,730.62 ████████████████████ █████████████████▌ █████████████████ Latin America 649,376,336 8.33% 270,985 14,363 2,033,210 5.30% 417.30 22.12 3,131.02 ██▌ ██ ██▌ Middle East 502,303,294 6.44% 210,382 8,267 3,419,502 3.93% 418.83 16.46 6,807.64 ██▌ █▌ █████ Russia etc.. 300,517,069 3.85% 195,736 2,137 5,818,812 1.09% 651.33 7.11 19,362.67 ████ ▌ ██████████████▌ Australia and Nz 30,322,117 0.39% 8,333 116 818,320 1.39% 274.82 3.83 26,987.56 █▌ ▌ ████████████████████ Asia 2,346,060,124 30.09% 160,207 7,104 2,173,275 4.43% 68.29 3.03 926.35 ▌ ▌ India etc.. 1,817,448,317 23.31% 79,466 2,243 1,527,972 2.82% 43.72 1.23 840.72 ▌ Africa 1,145,344,967 14.69% 29,792 688 656,983 2.31% 26.01 0.60 573.61 ▌ Polynesia 11,495,169 0.15% 26 0 3,409 0.00% 2.26 0.00 296.56 ──────────────── ───────────── ─────── ───────── ─────── ────────── ──────────── ──────── ──────── ───────── ──────────────────── ──────────────────── ──────────────────── World 7,797,436,051 100.00% 3,643,293 252,301 39,217,330 6.93% 467.24 32.36 5,029.52 ════════════════ ═════════════ ═══════ ═════════ ═══════ ══════════ ════════════ ════════ ════════ ═════════ ════════════════════ ════════════════════ ════════════════════
Yours Sincerely,
Gene Thomas
On Tue, 5 May 2020 at 20:45, Gene Thomas gene@genethomas.com wrote:
It works now, the config that works has both WithCss() and WithRenderDevice():
IConfiguration config = Configuration.Default .WithCss() .WithRenderDevice(new DefaultRenderDevice { DeviceHeight = 768, DeviceWidth = 1024, });
Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?
Yours Sincerely,
Gene Thomas.
On Tue, 5 May 2020 at 19:51, Florian Rappl notifications@github.com wrote:
Closed #877 https://github.com/AngleSharp/AngleSharp/issues/877.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/AngleSharp/AngleSharp/issues/877#event-3303060349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD5SOBRXUMZOZFYT7CBWU3RP7APTANCNFSM4MYLXVPQ .
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
-- Gene Thomas
021 436384 http://genethomas.com gene@genethomas.com
Now rather than taking 5 seconds it shall take 15 minutes! Is there a way so I can stop AngleSharp.Css doing so much work?
I cannot help you in full detail unless you show your config. For instance, it makes a big difference if you have resource requesting activated. Evaluating the style sheets is costly as render tree evaluations are much more expensive than DOM tree evaluations.
Usually, even for large sites AngleSharp evaluates in the fraction of a second and also AngleSharp.Css is not taking so long. If you think it takes too long we first need to know where the time in spent (e.g., style sheet evaluation(s) or the getInnerText
call - if the latter, where in there... can we optimize?).
Coming back to the OP - I assume now that getInnerText
satisfies your needs. If perf. can / should be improved I would suggest opening a new item on AngleSharp.Css. This way we can track progress properly and have a disucssion with the folks there.
Remark: Indeed the use case is great and I fully appreciate the info! Wonderful stuff! :rocket:
New Feature Proposal
When extracting information from web pages, such as the main table #main_table_countries_today in https://www.worldometers.info/coronavirus/ one wants the text as the user sees it: \n and \t converted to space,
as \n; html entities expanded, e.g. > → >; multiple spaces to one space and text Trim().
Presently the .Text() and .TextContent of the second header cells are “TotalCases” as the <br> has been removed. Implementing this on top of Angle Sharp is not ideal as one has to implement all of the html entities such as $quot.
If you do not want to break existing users of .Text [you could since you have not got to v1.0 yet] a property called .UserText would be good.
The following code does most of that that I requrie:
Attached is a Visual Studio C# project that shows the feature, the user sees:
but .Text() and .TextContent return “onetwo”:
AngleSharpBrFault.zip .