Get parts of content independently

ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document

Apache License 2.0

2.15k stars 221 forks source link

Get parts of content independently #13

Closed bradvogel closed 10 years ago

bradvogel commented 10 years ago

It'd be nice to be able to get title, image, and description for a page without getting the full text. Parsing the text can be very slow for long pages (e.g. http://en.wikipedia.org/wiki/Apple_Inc takes 2 seconds on my macbook).

Perhaps, something like:

var extractor = require('unfluff');
extractor(my_html_data, {
    lang: 'en', // Optional language
    text: false // Don't fetch text
});

or perhaps just change the API to expose the functions separately on the exports.

ageitgey commented 10 years ago

Thanks for the suggestion. Thats a reasonable idea, but I'm worried about making the API too complex when most people use this library just for the full text functionality. I don't want to add a lot of flags and stuff. Let me think about a simple way to implement this.

bradvogel commented 10 years ago

You're right and I agree that most people probably use it for text(). But the other functions are really useful also. What about just exposing them on the exports, e.g.

var extractor = require('unfluff');
var everything = extractor(html);
var justTitle = extractor.title(html);

ageitgey commented 10 years ago

Something like that is pretty reasonable. PRs welcome or I'll take a look when I have a few minutes free.

Thanks again for the feedback! :)

bradvogel commented 10 years ago

Thanks for doing this! Unfortunately I won't have time for PR this week.

franza commented 10 years ago

I would like to contribute. Mind if I take a look at this?

ageitgey commented 10 years ago

Sure @franza, you are welcome to take a pass at it. Also feel free to share a work in progress even if it's not totally done and tested.

ageitgey commented 10 years ago

Thanks! Released in v0.7.0. It is called extractor.lazy().