AsyncHttpClient / async-http-client

Asynchronous Http and WebSocket Client library for Java
Other
6.27k stars 1.59k forks source link

Obeying the robots.txt file #1989

Closed TechnologyClassroom closed 2 weeks ago

TechnologyClassroom commented 2 weeks ago

I've recently found this project while reading server logs. Someone is scraping one of the sites that I help administer supposedly using AHC/2.1 and they are not obeying the robots.txt file. There should be several seconds of delay between requests, but it appears to be going a 1 request/second. Is this normal behavior for AHC or is this a user misconfiguration in some way? If this is normal, could robots.txt file support for Crawl-delay values be added by default?

hyperxpro commented 2 weeks ago

The user must have configured it to crawl your web server every 1 second. AHC is an HTTP client library and it's clearly up to the user how they intend to use it. Also, there are no plans to support robots.txt at the moment.