Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
158 stars 36 forks source link

Readability.js returns context, SmartReader does not #45

Closed wardboumans closed 2 years ago

wardboumans commented 2 years ago

Testing on https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326 I get no result with SmartReader, but I do with Readability.js.

I use Playwright (headless Chrome) to get the html and feed it to SmartReader. > no content. If I use Playwright and Eval Readability.js against the page, I do get context (the buildin Firefox reader also works fine).

Strange if its a direct port.

My test code:

using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync();
var page = await browser.NewPageAsync();

var url = "https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326";
await page.GotoAsync(url);
var html = await page.ContentAsync();

// Readability.js
var json = await page.EvaluateAsync("async () => { const readability = await import('https://cdn.skypack.dev/@mozilla/readability'); return (new readability.Readability(document)).parse();}");
Console.WriteLine(json.ToString()); // CONTENT

//SmartReader
SmartReader.Reader sr = new SmartReader.Reader(url, html);
sr.Debug = true;
sr.LoggerDelegate = Console.WriteLine;
sr.Logging = SmartReader.ReportLevel.Info;
SmartReader.Article article = sr.GetArticle();
Console.WriteLine(article.IsReadable);  // FALSE
gabriele-tomassetti commented 2 years ago

Thanks for your feedback.

It is a direct port of the main algorithm, but I can see some potential explanations:

I would try to troubleshoot the issue this weekend.

wardboumans commented 2 years ago

is it possible that when you evaluate the code fragment for Readability you also evaluate JavaScript present on the page, thus changing the content?

Javascript is already executed since I use Playwright to render the page instead of only downloading the html. I read that Readability can change the DOM but I made sure I saved the html before doing the Eval.

Thanks for having a look at this!

gabriele-tomassetti commented 2 years ago

The problem is that AngleSharp behaves differently from the web parser. Basically, document.querySelectorAll returns nothing, while it does returns a few nodes in the browser or node. I still have not found why this happens, but I am going to see if there is a fix.

acidus99 commented 2 years ago

For what it's worth, Kokatu article pages don't render if JavaScript is turned off. You can try it using the submitted URL. Disable JavaScript on your browser (I tested both Safari and Chrome), load that URL, and no content at all is rendered.

What I suspect is happening:

@wardboumans as an aside, other websites behave this way too. Kokatu does include an embedded JSON-LD block, which contains the content of the article. You lose some formatting, but this works as a fallback for CMSs that require JS but still want to present meta data to crawlers

wardboumans commented 2 years ago

@acidus99 You are correct, but I don't use Smartreader to download the html. I take the rendered output html from Playwright and feed it to SmartReader. Atleast that is what im trying to do. Am I calling Smartreader the wrong way?

gabriele-tomassetti commented 2 years ago

I was able to fix this issue. It was due to a noscript tag. It was a mistake on my part, because AngleSharp was handling the tag correctly. All we had to do was changing an option in the HtmlParser, see documentation and this issue

wardboumans commented 2 years ago

Awesome, thanks man!