Readability.js returns context, SmartReader does not

wardboumans commented 2 years ago

Testing on https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326 I get no result with SmartReader, but I do with Readability.js.

I use Playwright (headless Chrome) to get the html and feed it to SmartReader. > no content. If I use Playwright and Eval Readability.js against the page, I do get context (the buildin Firefox reader also works fine).

Strange if its a direct port.

My test code:

using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();

var url = "https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326";
await page.GotoAsync(url);
var html = await page.ContentAsync();

// Readability.js
var json = await page.EvaluateAsync("async () => { const readability = await import('https://cdn.skypack.dev/@mozilla/readability'); return (new readability.Readability(document)).parse();}");
Console.WriteLine(json.ToString()); // CONTENT

//SmartReader
SmartReader.Reader sr = new SmartReader.Reader(url, html);
sr.Debug = true;
sr.LoggerDelegate = Console.WriteLine;
sr.Logging = SmartReader.ReportLevel.Info;
SmartReader.Article article = sr.GetArticle();
Console.WriteLine(article.IsReadable);  // FALSE

gabriele-tomassetti commented 2 years ago

Thanks for your feedback.

It is a direct port of the main algorithm, but I can see some potential explanations:

it is possible that we make a mistake
the underlying libraries that implements the HTML parser behaves differently in Mozilla and in C#.
is it possible that when you evaluate the code fragment for Readability you also evaluate JavaScript present on the page, thus changing the content?

I would try to troubleshoot the issue this weekend.

wardboumans commented 2 years ago

is it possible that when you evaluate the code fragment for Readability you also evaluate JavaScript present on the page, thus changing the content?

Javascript is already executed since I use Playwright to render the page instead of only downloading the html. I read that Readability can change the DOM but I made sure I saved the html before doing the Eval.

Thanks for having a look at this!

gabriele-tomassetti commented 2 years ago

The problem is that AngleSharp behaves differently from the web parser. Basically, document.querySelectorAll returns nothing, while it does returns a few nodes in the browser or node. I still have not found why this happens, but I am going to see if there is a fix.

acidus99 commented 2 years ago

For what it's worth, Kokatu article pages don't render if JavaScript is turned off. You can try it using the submitted URL. Disable JavaScript on your browser (I tested both Safari and Chrome), load that URL, and no content at all is rendered.

What I suspect is happening:

Playwright executes JavaScript by default, so it makes changes to the DOM necessary to render the article. Readability.js is then injected into the page and its algorithm is able to detect and extract the article.
SmartReader loads the page, and does not executed JavaScript. The algorithm is unable to extract the article given the initial state of the DOM

@wardboumans as an aside, other websites behave this way too. Kokatu does include an embedded JSON-LD block, which contains the content of the article. You lose some formatting, but this works as a fallback for CMSs that require JS but still want to present meta data to crawlers

wardboumans commented 2 years ago

@acidus99 You are correct, but I don't use Smartreader to download the html. I take the rendered output html from Playwright and feed it to SmartReader. Atleast that is what im trying to do. Am I calling Smartreader the wrong way?

gabriele-tomassetti commented 2 years ago

I was able to fix this issue. It was due to a noscript tag. It was a mistake on my part, because AngleSharp was handling the tag correctly. All we had to do was changing an option in the HtmlParser, see documentation and this issue

wardboumans commented 2 years ago

Awesome, thanks man!

Strumenta / SmartReader

Readability.js returns context, SmartReader does not #45