Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
158 stars 36 forks source link

Constructor to pass already parsed IHtmlDocument? #46

Closed acidus99 closed 2 years ago

acidus99 commented 2 years ago

This is not a typically request, so I understand if it doesn't make sense.

I am using SmartReader in a project. I use AngleSharp to parse the document and extract some data from outside the main article content (like navigation links inside of <nav> or <header>) and then I use SmartReader to extract the article content. This mean the code is parsing the entire HTML document twice. First when I call AngleSharp's parser.ParseDocument(html); to do my own analysis, and a second time inside SmartReader's Reader() constructor. Since I'm working with 1-2 MB HTML files, this double parsing is having an impact on overall performance.

Would you consider adding an additional Reader constructor (or static ParseArticle() method) where I can pass in a IHtmlDocument from AngleSharp to avoid this double parsing?

gabriele-tomassetti commented 2 years ago

It already exists an alternative approach that could work for your problem. There is a feature in SmartReader to perform custom operations either at the beginning or the end of the process of extracting the article.

You could add your custom code to extract the data you need in a custom operation at the start, before the article is processed. This way you would just parse the article with AngleSharp once, inside SmartReader. Would this approach work for you?

acidus99 commented 2 years ago

Thanks for the note about custom operations. I had looked at those. The challenge I have is that when I fetch and parse a webpage, based on what metadata I find, I conditionally may use SmartReader to extract the content, or I may not.

As an example. I will fetch https://www.wired.com and look for OpenGraph metadata tags, specifically <meta property="og:type">, to see what type of page this is. The homepage uses type website, so there isn't valuable content to extract. Instead, I collect the links of the news stories to show the user. When I fetch something like this news story on Wired, I parse it and see an OpenGraph type of article. In which case, I use SmartReader to extract the text.

So only some of the time do I want to use SmartReader. And when I do, I already have a parsed IHtmlDocument from AngleSharp. Like I said, its probably not a common use case for your users

gabriele-tomassetti commented 2 years ago

Well, we are already creating an explicit dependency on AngleSharp, since we allow users of the library to define custom operations that depends on AngleSharp code. So it would not be worse to also have a constructor that explicitly use an AngleSharp element. I think we can add this constructor.

acidus99 commented 2 years ago

Thank you!