facundoolano / feedi

RSS + Mastodon feed reader
GNU Affero General Public License v3.0
892 stars 28 forks source link

Use a node.js readability package for article extraction #10

Closed facundoolano closed 1 year ago

facundoolano commented 1 year ago

After trying several python libraries (newspaper, news-please, python-readability, readabilipy, trafilatura, goose3), I found none of them to be quite as precise at producing HTML content as the mozilla reader mode, which is exposed as a standalone node.js library. This PR adds a (hacky) node script to use that library from a python subprocess.

Note that the readabilipy lib does a similar job, but it doesn't pass the right arguments to the JSDOM library, so relative images are not properly processed. Therefore using this custom hack, at least for now.

We could also consider assuming the user has firefox or safari reader mode available and send to that instead of trying reproducing the behavior from within the app.