bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.7k stars 877 forks source link

Can this tool actually crawl/spider, or just scrape pages? #383

Closed bogdancss closed 2 years ago

bogdancss commented 3 years ago

Hey,

I may not fully understand these terms, but can this tool actually crawl/spider all the pages under a domain, or does it just scrape a specific url?

When I say crawl/spider, I am thinking of something like the ScreamingFrog Spider tool, where you can provide an url, and it will find all (most) other pages on that site.

Please feel free to close this issue, but I feel the tool description needs to be a bit more clear.

Thanks

paulkre commented 3 years ago

I agree, node-scraper would be a more fitting name for this tool. Or is there an easy configuration to make it behave like an actual crawler?

mike442144 commented 2 years ago

Yes, "scraper" should be much better. But never mind, you may implement a spider by yourself based on this.

raquelmsmith commented 2 years ago

But never mind, you may implement a spider by yourself based on this.

how do you do this?

mike442144 commented 2 years ago
  1. Figure out the home page or entrance URL which is good to start;
  2. Send request to the URL(s);
  3. Parse the page content that you get from the response to get all the URLs you care about, which may be the same domain as the previous one;
  4. Save the page content to a file or Db whatever you want;
  5. Repeat from step 2 to end.
mike442144 commented 2 years ago

solved in #420