ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

unfluff

An automatic web page content extractor for Node.js!

Build Status

Automatically grab the main text out of a webpage like this:

extractor = require('unfluff');
data = extractor(my_html_data);
console.log(data.text);

In other words, it turns pretty webpages into boring plain text/json data:

This might be useful for:

Please don't use this for:

Credits / Thanks

This library is largely based on python-goose by Xavier Grangier which is in turn based on goose by Gravity Labs. However, it's not an exact port so it may behave differently on some pages and the feature set is a little bit different. If you are looking for a python or Scala/Java/JVM solution, check out those libraries!

Install

To install the command-line unfluff utility:

npm install -g unfluff

To install the unfluff module for use in your Node.js project:

npm install --save unfluff

Usage

You can use unfluff from node or right on the command line!

Extracted data elements

This is what unfluff will try to grab from a web page:

This is returned as a simple json object.

Command line interface

You can pass a webpage to unfluff and it will try to parse out the interesting bits.

You can either pass in a file name:

unfluff my_file.html

Or you can pipe it in:

curl -s "http://somesite.com/page" | unfluff

You can easily chain this together with other unix commands to do cool stuff. For example, you can download a web page, parse it and then use jq to print it just the body text.

curl -s "https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | jq -r .text

And here's how to find the top 10 most common words in an article:

curl -s "https://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff |  tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -10

Module Interface

extractor(html, language)

html: The html you want to parse

language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.

The extraction algorithm depends heavily on the language, so it probably won't work if you have the language set incorrectly.

extractor = require('unfluff');

data = extractor(my_html_data);

Or supply the language code yourself:

extractor = require('unfluff');

data = extractor(my_html_data, 'en');

data will then be a json object that looks like this:

{
  "title": "Shovel Knight review",
  "softTitle": "Shovel Knight review: rewrite history",
  "date": "2014-06-26T13:00:03Z",
  "copyright": "2016 Vox Media Inc Designed in house",
  "author": [
    "Griffin McElroy"
  ],
  "publisher": "Polygon",
  "text": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]",
  "image": "http://cdn2.vox-cdn.com/uploads/chorus_image/image/34834129/jellyfish_hero.0_cinema_1280.0.png",  
  "tags": [],
  "videos": [],
  "canonicalLink": "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u",
  "lang": "en",
  "description": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it.",
  "favicon": "http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico",
  "links": [
    { "text": "Six Thirty", "href": "http://www.sixthirty.co/" }
  ]
}

extractor.lazy(html, language)

Lazy version of extractor(html, language).

The text extraction algorithm can be somewhat slow on large documents. If you only need access to elements like title or image, you can use the lazy extractor to get them more quickly without running the full processing pipeline.

This returns an object just like the regular extractor except all fields are replaced by functions and evaluation is only done when you call those functions.

extractor = require('unfluff');

data = extractor.lazy(my_html_data, 'en');

// Access whichever data elements you need directly.
console.log(data.title());
console.log(data.softTitle());
console.log(data.date());
console.log(data.copyright());
console.log(data.author());
console.log(data.publisher());
console.log(data.text());
console.log(data.image());
console.log(data.tags());
console.log(data.videos());
console.log(data.canonicalLink());
console.log(data.lang());
console.log(data.description());
console.log(data.favicon());

Some of these data elements require calculating intermediate representations of the html document. Everything is cached so looking up multiple data elements and looking them up multiple times should be as fast as possible.

Demo

The easiest way to try out unfluff is to just install it:

$ npm install -g unfluff
$ curl -s "http://www.cnn.com/2014/07/07/world/americas/mexico-earthquake/index.html" | unfluff

But if you can't be bothered, you can check out fetch text. It's a site by Andy Jiang that uses unfluff. You send an email with a url and it emails back with the cleaned content of that url. It should give you a good idea of how unfluff handles different urls.

What is broken