georgjaehnig / webpages-to-ebook

Create an EPUB from a list of URLs. Standing on the shoulders of Wget, Readability and Pandoc.
MIT License
193 stars 16 forks source link

question about the readability module dependency: mozila or luin ? #14

Open m040601 opened 4 years ago

m040601 commented 4 years ago

A small question about the readability dependency. On the README.md page you write:

Create an EPUB from a list of URLs. Standing on the shoulders of Wget → Readability → Pandoc

where "Readability" links to, https://github.com/mozilla/readability

But your project actually uses, https://github.com/luin/readability , which actually installs a module called "node-readability"

I know that "luin" is probablily a fork or something pulling from "mozilla". I just wanted to make sure there is a reason for this, and for not pulling directly from mozilla.

I ask this becaus I've been testing dozens of node based readability projects, and very frequently because they choose to name their binary "readability" you end up with a mess of different packages and/or their installed binary named "readability" .

georgjaehnig commented 4 years ago

Good question – I think I just took the readability project that appeared first in my npm search.

And in the meantime I was actually trying out https://www.npmjs.com/package/article-parser – yet still having some issutes with it. Do you know the project?

I understand you recommend using

npm install @mozilla/readability

?

m040601 commented 4 years ago

https://www.npmjs.com/package/article-parser Do you know the project?

No never heard about it. Actually I search more in github than npm registry for this kind of things. But what I can tell you from my experience in the last years is that many of these node modules or readability libraries make a big splash at the beginning, but end up totally abandoned and unmaintained after a few months. No matter if it is Node or Go or Python. So I tend to watch for the ones one cant trust for the long run.

I understand you recommend using npm install @mozilla/readability ?

I'm not a developer or programmer. And I wouldnt' touch node even with a pole :-). So dont take my opinion from an expert. I got my information from here:

https://github.com/qutebrowser/qutebrowser/pull/5009 Looks like the readability library is available via npm now:

It seems before you had to do

npm install -g https://github.com/mozilla/readability.git

But now you can do,

npm install -g @mozilla/readability

You can't do

npm install -g readability

Because someone else already took that name "readability".

But I've been collectinng and testing lotz of this kind of "readability" apps in node/python/etc for the last year.

Since the Firefox Reader Mode seems to do a good job and Mozilla has lotz of resources and developers I always "assumed" that the Mozilla node one might be the best one to use. Because I dont understand node, I could never make a simple ready made cli tool out of the mozilla library myself. That's why I looked for others ready made, even though I dont like to install node on my system. And I hate having to pull dozens or hundreds of small node modules as dependencies.

But honestly, after having tested so much of these readability extractors, I think it doest make that big difference at all. It's very dependable on the website. Modern websites are so complicated.. Sometimes even the simplest python script based on the original readability algorithm does the job accetably.

Let me see if I can find more (node tools) in my notes:

Mozilla Readability based:

  1. gardenappl / readability-cli · GitLab This guy seems to pull directly from mozilla the "official" readability https://gitlab.com/gardenappl/readability-cli

  2. NightMachinary/readability-cli: A CLI for Mozilla Readability. Get clean, uncluttered, messes my system because it installs a binary simply called "readability" that conflicts with others https://github.com/NightMachinary/readability-cli

  3. aarmea/readability-scrape: last update 2018, it imports Readability.js, the library used in Firefox's reader view, directly from Mozilla's repository https://github.com/aarmea/readability-scrape

  4. qutebrowser/readability-js at master this is a node the script that qutebrowser uses to get a "Reader Mode" just like Firefox. uses the official Mozilla's readability library (npm install -g @mozilla/readability) https://github.com/qutebrowser/qutebrowser/blob/master/misc/userscripts/readability-js

  5. enrico-kaack/markdown-clipper: very interesting firefox extension, also uses official mozilla readability https://github.com/enrico-kaack/markdown-clipper

  6. pirate/readability-extractor: Wrapper around mozilla/readability to keep archivebox free from nodejs https://github.com/pirate/readability-extractor

  7. danburzo/percollate: A command-line tool to turn web pages into beautiful, readable documents in PDF, EPUB, or HTML format. It's a "big machine" . Pulls almost 400 mega of chromium pupeteer.(fake browser ?) Tries to do everything and the kitchen sink in node. So it's not like yours "on the shoulders' of giants But it seems polished and well maintained. Also seems to use mozilla readabilyty https://github.com/danburzo/percollate

Not Mozilla Readability based:

  1. croqaz/clean-mark: Convert an article into a clean text Also very nice project https://github.com/croqaz/clean-mark