Closed wardboumans closed 2 years ago
Thanks for your feedback.
It is a direct port of the main algorithm, but I can see some potential explanations:
I would try to troubleshoot the issue this weekend.
is it possible that when you evaluate the code fragment for Readability you also evaluate JavaScript present on the page, thus changing the content?
Javascript is already executed since I use Playwright to render the page instead of only downloading the html. I read that Readability can change the DOM but I made sure I saved the html before doing the Eval.
Thanks for having a look at this!
The problem is that AngleSharp behaves differently from the web parser. Basically, document.querySelectorAll
returns nothing, while it does returns a few nodes in the browser or node. I still have not found why this happens, but I am going to see if there is a fix.
For what it's worth, Kokatu article pages don't render if JavaScript is turned off. You can try it using the submitted URL. Disable JavaScript on your browser (I tested both Safari and Chrome), load that URL, and no content at all is rendered.
What I suspect is happening:
@wardboumans as an aside, other websites behave this way too. Kokatu does include an embedded JSON-LD block, which contains the content of the article. You lose some formatting, but this works as a fallback for CMSs that require JS but still want to present meta data to crawlers
@acidus99 You are correct, but I don't use Smartreader to download the html. I take the rendered output html from Playwright and feed it to SmartReader. Atleast that is what im trying to do. Am I calling Smartreader the wrong way?
I was able to fix this issue. It was due to a noscript
tag. It was a mistake on my part, because AngleSharp was handling the tag correctly. All we had to do was changing an option in the HtmlParser, see documentation and this issue
Awesome, thanks man!
Testing on https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326 I get no result with SmartReader, but I do with Readability.js.
I use Playwright (headless Chrome) to get the html and feed it to SmartReader. > no content. If I use Playwright and Eval Readability.js against the page, I do get context (the buildin Firefox reader also works fine).
Strange if its a direct port.
My test code: