fivefilters / ftr-site-config

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
https://www.fivefilters.org/full-text-rss/
Other
357 stars 253 forks source link

Lesserwrong.com page cannot be parsed #385

Open tinloaf opened 6 years ago

tinloaf commented 6 years ago

Hi,

I'd like to parse pages from www.lesserwrong.com. I've tried creating a site config based on this page: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

This is how my site config looks like:

title: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-title ')]//h1
body: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-html ')]
date: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-metadata-date ')]
author: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-author ')]
test_url: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

As far as I can tell, these XPaths all point to the correct elements inside that page. Howeve, the tool at https://f43.me/feed/test still fails to parse the page. Did I mess up the site config, or is this a bug in the parser (and if so, is this the right repository to report such a bug?)

j0k3r commented 6 years ago

I think the problem isn't on your side but on lesserwrong.com which is using Cloudflare so I guess it might be the same issue than https://github.com/wallabag/wallabag/issues/1399#issuecomment-350988404

tinloaf commented 6 years ago

That might very well be it. Is there a way of seeing the HTML that the parser sees? Then I could verify that it's in fact the Cloudflare anti-bot page.

j0k3r commented 6 years ago

Without going into the code of wallabag/graby, no you can't. Find that file and var_dump() the $html: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L203