PlaidWeb / reblob

Python program and library for extracting a quoted blog reply
MIT License
3 stars 0 forks source link

Better content blob detection mechanism #8

Open fluffy-critter opened 5 years ago

fluffy-critter commented 5 years ago

The current content detection mechanism completely hecks up on this article:

https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/

Instead of the article text I just got a list of commenters:

$ reblob https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/ --format markdown_github
[Riz](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Robbah99](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Gene](http://saasgenius.io):

[Dan Tompkins](https://l-o-o-s-e-d.net):

[Daniel Kulesz](https://www.kulesz.me):

[Paul Reiber](https://medium.com/@reiber):

[Judith A Sweeney](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Melissa](http://www.melissagutierrezdesign.com):

[Diego Fernandes de Oliveira](http://Atibaia-SP):

[T00M](http://optrickmedia.com/):

[Joni](https://www.webdistortion.com/jonimueller.com):

[Joni](https://www.webdistortion.com/jonimueller.com):

[Elu Cia](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[AJ Danelz](http://your-media.netlify.com):

[Zach](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Dogo](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Bret Perry](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Michael Long](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[Michael](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[L J Laubenheimer](http://blog.laubenheimer.net):

[Eduardo Weidman Barijan](https://www.brdigital.com):

[Edie Shack](https://www.webdistortion.com/2019/05/16/can-we-all-please-stop-using-medium-now/):

[barry begus](http://swagdesignfactory.com):

The content detection mechanism could use some better heuristics.

This particular issue seems to be that mf2py hecked up; given that this page has no mf2 markup on it, there should probably be a more aggressive fallback to the heuristic mechanism based on the determined content of the mf2.