amoilanen / js-crawler

Web crawler for Node.JS
MIT License
253 stars 55 forks source link

I think shouldCrawl code example is incorrect #53

Closed JeffreyConnected closed 6 years ago

JeffreyConnected commented 6 years ago

The code example shows:

   shouldCrawl: function (url) {
        return url.indexOf("google.com") < 0;
    }, 

But I think the "<" is wrong and should be ">"

amoilanen commented 6 years ago

Thanks for reporting the issue, it looks like no links will be crawled in the following example

var Crawler = require("js-crawler");

var crawler = new Crawler().configure({
  shouldCrawlLinksFrom: function(url) {
    return url.indexOf("google.com") < 0;
  }
});

crawler.crawl("https://www.google.com/search?q=foo", function(page) {
  console.log(page.url);
});

There are 2 configuration options:

Will try to provide better examples/document the feature better.

amoilanen commented 6 years ago

Updated the documentation, the new example is as follows:

var Crawler = require("js-crawler");

var rootUrl = "http://www.reddit.com/";

function isSubredditUrl(url) {
  return !!url.match(/www\.reddit\.com\/r\/[a-zA-Z0-9]+\/$/g);
}

var crawler = new Crawler().configure({
  shouldCrawl: function(url) {
    return isSubredditUrl(url) || url == rootUrl;
  }
});

crawler.crawl(rootUrl, function(page) {
  console.log(page.url);
});