Closed JeffreyConnected closed 6 years ago
Thanks for reporting the issue, it looks like no links will be crawled in the following example
var Crawler = require("js-crawler");
var crawler = new Crawler().configure({
shouldCrawlLinksFrom: function(url) {
return url.indexOf("google.com") < 0;
}
});
crawler.crawl("https://www.google.com/search?q=foo", function(page) {
console.log(page.url);
});
There are 2 configuration options:
shouldCrawl
which determines whether a given url should be crawledshouldCrawlLinksFrom
which determines if child links from a given url should be crawledWill try to provide better examples/document the feature better.
Updated the documentation, the new example is as follows:
var Crawler = require("js-crawler");
var rootUrl = "http://www.reddit.com/";
function isSubredditUrl(url) {
return !!url.match(/www\.reddit\.com\/r\/[a-zA-Z0-9]+\/$/g);
}
var crawler = new Crawler().configure({
shouldCrawl: function(url) {
return isSubredditUrl(url) || url == rootUrl;
}
});
crawler.crawl(rootUrl, function(page) {
console.log(page.url);
});
The code example shows:
But I think the "<" is wrong and should be ">"