Closed oliverpool closed 11 months ago
That's one of the libraries I tried, but the output wasn't as good as the mozilla one.
the output wasn't as good as the mozilla one
That's good to know. I've been using https://github.com/buriy/python-readability for a similar personal project and may take inspiration from your approach here of shelling out to Node + Mozilla's readability implementation. What are the specific differences you've noticed?
The error I remember most clearly now was that it would skip paragraphs at the beginning and/or at the end of the document. For example, I used this article for testing, and python-readability would chop off the first 10 and the last 14 paragraphs, whereas the mozilla one shows all the content.
I tried several other options (newsplease, newspaper3k, trafilatura, goose3, etc) and all have problems, either they didn't recognize the content properly or they forced it into plaintext, didn't support images, etc.
There was also ReadabiliPy, which pretty much does the same as I did here (the external node.js process thing), but with more boilerplate and slightly off configuration, so I preferred to use my own little node.js script.
Thank your for taking the time to let us know about your experience with this lib !
Hi, I just read your blog post.
Would the project https://github.com/buriy/python-readability help you integrate the reader-parsing on server side?
(I saw it being used by offpunk)