gabceb / node-metainspector

Node npm for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, an array with all the links, all the images in it, etc. Inspired by the metainspector Ruby gem
MIT License
129 stars 52 forks source link

Big GIFs can cause application crash #11

Open lorenzos opened 9 years ago

lorenzos commented 9 years ago

I don't know if all big files, or only specific ones, but this:

var mi = require('node-metainspector');
new mi("http://media.giphy.com/media/euBj15T6nrp6g/giphy.gif", {}).fetch();

Crashes the application with:

node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:47
  _.each(dom, function(elem) {
                      ^
RangeError: Maximum call stack size exceeded
    at /home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:47:23
    at Array.forEach (native)
    at Function._.each._.forEach (/home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/node_modules/underscore/underscore.js:78:11)
    at exports.connect (/home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:47:5)
    at /home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:64:7
    at Array.forEach (native)
    at Function._.each._.forEach (/home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/node_modules/underscore/underscore.js:78:11)
    at exports.connect (/home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:47:5)
    at /home/pi/hangouts-bots/node_modules/node-metainspector/node_modules/cheerio/lib/parse.js:64:7
    at Array.forEach (native)

If this is hard to solve, I will be glad to know if there is some workaround in order to, at least, keep my application running.

alexymik commented 9 years ago

I had the same issue, seems like it's to do with large requests. I set a request size limit which seems to have fixed the issue for now:

new MetaInspector(param, { limit: 3000000 } );
19h commented 9 years ago

Why would you crawl inspect images? This is for meta information about websites.

alexymik commented 9 years ago

From the project description:

"You give it an URL, and it lets you easily get its title, links, images, description, keywords, meta tags...."

It should not crash when given a URL, even if it's given a URL to an image. The Ruby MetaInspector gem also returns image resolution, so a future feature would be to port that functionality over as well.

19h commented 9 years ago

Here's the source code so PR maybe?

sshen81 commented 8 years ago

I've run into similar issues with non-HTML URIs causing errors. Specifically, passing non-HTML body data into the cheerio module can cause call stack exceptions. I've submitted PR #25 to help address this.