bda-research / node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)
MIT License
6.68k stars 876 forks source link

robots.txt awareness #35

Open reezer opened 11 years ago

reezer commented 11 years ago

Does it lie in the scope of this project to support robots.txt?

sylvinus commented 11 years ago

Should be optional but yes I think so!

sylvinus commented 11 years ago

We should add a dependency on https://github.com/ekalinin/robots.js ?

reezer commented 11 years ago

Maybe. Just a side node: Crawlers could also use a Sitemap and currently the robots.js doesn't parse them.

koolma commented 7 years ago

Hi @sylvinus, I know that this issue is very old already, but is there any support of robots.txt files and/or sitemaps yet?

sylvinus commented 7 years ago

Hi!

Sorry about that, but I'm not the maintainer anymore.

Best,

On Sat, Jun 17, 2017 at 9:35 PM koolma notifications@github.com wrote:

Hi @sylvinus https://github.com/sylvinus, I know that this issue is very old already, but is there any support of robots.txt files and/or sitemaps yet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bda-research/node-crawler/issues/35#issuecomment-309235524, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZDVKqFODIydeJHvn58PbekQT6BX3sLks5sFCqQgaJpZM4ATbxS .

--

-- Sylvain Zimmer

blog: sylvinus.org mobile: +33 6 64 67 61 71

mike442144 commented 7 years ago

@koolma Can you provide more details?

koolma commented 7 years ago

@mike442144 I would like to know if crawlers implemented with this project respect the robots.txt files on the servers crawled, respectivley if they make use of a site map to discover URLs.

mike442144 commented 7 years ago

Not yet, for the crawler module is not that kind of spider to fetch web pages for search engine use. However, I think it could be another option that visit web pages respect the robots.txt. I don't have much free time now, go ahead to add this feature if you need. We could have further more discussions if any problems.What do you think of it?

knoxcard commented 6 years ago

This one a beast...

"A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them."

https://www.npmjs.com/package/sitemap-stream-parser