facundoolano / feedi

RSS + Mastodon feed reader
GNU Affero General Public License v3.0
892 stars 28 forks source link

python-readability #61

Closed oliverpool closed 11 months ago

oliverpool commented 11 months ago

Hi, I just read your blog post.

Would the project https://github.com/buriy/python-readability help you integrate the reader-parsing on server side?

(I saw it being used by offpunk)

facundoolano commented 11 months ago

That's one of the libraries I tried, but the output wasn't as good as the mozilla one.

scttnlsn commented 11 months ago

the output wasn't as good as the mozilla one

That's good to know. I've been using https://github.com/buriy/python-readability for a similar personal project and may take inspiration from your approach here of shelling out to Node + Mozilla's readability implementation. What are the specific differences you've noticed?

facundoolano commented 11 months ago

The error I remember most clearly now was that it would skip paragraphs at the beginning and/or at the end of the document. For example, I used this article for testing, and python-readability would chop off the first 10 and the last 14 paragraphs, whereas the mozilla one shows all the content.

I tried several other options (newsplease, newspaper3k, trafilatura, goose3, etc) and all have problems, either they didn't recognize the content properly or they forced it into plaintext, didn't support images, etc.

There was also ReadabiliPy, which pretty much does the same as I did here (the external node.js process thing), but with more boilerplate and slightly off configuration, so I preferred to use my own little node.js script.

oliverpool commented 11 months ago

Thank your for taking the time to let us know about your experience with this lib !