After trying several python libraries (newspaper, news-please, python-readability, readabilipy, trafilatura, goose3),
I found none of them to be quite as precise at producing HTML content as the mozilla reader mode, which is exposed as a standalone node.js library. This PR adds a (hacky) node script to use that library from a python subprocess.
Note that the readabilipy lib does a similar job, but it doesn't pass the right arguments to the JSDOM library, so relative images are not properly processed. Therefore using this custom hack, at least for now.
We could also consider assuming the user has firefox or safari reader mode available and send to that instead of trying reproducing the behavior from within the app.
After trying several python libraries (newspaper, news-please, python-readability, readabilipy, trafilatura, goose3), I found none of them to be quite as precise at producing HTML content as the mozilla reader mode, which is exposed as a standalone node.js library. This PR adds a (hacky) node script to use that library from a python subprocess.
Note that the readabilipy lib does a similar job, but it doesn't pass the right arguments to the JSDOM library, so relative images are not properly processed. Therefore using this custom hack, at least for now.
We could also consider assuming the user has firefox or safari reader mode available and send to that instead of trying reproducing the behavior from within the app.