ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

calculateBestNode claims no nodesWithText on facebook developer page #44

Closed cmkimerer closed 9 years ago

cmkimerer commented 9 years ago

I was testing out unfluff on the url https://developers.facebook.com/docs/facebook-login/access-tokens and realized that no article text extraction is actually happening. It successfully pulled an image, description, and title, but the text appears blank.

ageitgey commented 9 years ago

If you look at the html source of https://developers.facebook.com/docs/facebook-login/access-tokens, it seems like all the actual page text is commented out (inside html comments) instead of being normal text in the page. I'm guessing some client-side javascript runs on their page to render what you see on the screen after the initial page load.

So you would need to do your own custom processing to capture what the actual browser rendered after page load since the initial html that comes back from their servers doesn't actually include the page text. That's a special case specific to this website that is beyond anything that unfluff could support directly.