codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

unexpected behavior of robots_txt option #134

Open viktor-svirsky opened 6 years ago

viktor-svirsky commented 6 years ago

Hi @johtani, I have faced with unexpected behavior when I try to grub site with an enabled robots_txt option.

robots.txt like: User-agent: * Disallow: /

and my expected result that the site will not be crawled.

I have tried to change user agent as User-agent: River Web
User-agent: RiverWeb,

and there are results.

Please advise.

marevol commented 6 years ago

The behavior depends on a crawling configuration.