Closed GoogleCodeExporter closed 9 years ago
I think, the wordpress site has some plugins to filter Useragents other than
just robots.txt.
Enable Logger output.
BasicConfigurator.configure();
Set Logger to WARN Level.
Logger.getRootLogger().setLevel(Level.WARN);
I can say the crawling is blocked by the server. If you change the UserAgent
String to Empty with below code, it crawl the data.
config.setUserAgentString(""); Note you can user your name as well.
So i think its nothing to do with crawler4j. Crawler4j sets the default
Useragent string, which i think blocked or its useragent string is blacklisted
by such plugins.
Original comment by jeba.ride@gmail.com
on 8 May 2014 at 11:35
Original comment by avrah...@gmail.com
on 18 Aug 2014 at 3:51
It appears that wordpress.com has blocked our crawler, identified by it's
userAgent.
Somebody in the past has probably crawled wordpress.com - they saw it as an
attack on their systems and blocked our userAgent.
This problem is easy to solve though, just set your custom userAgent and you
can crawl any wordpress.com site you want.
How to do that?
config.setUserAgentString(""); // Set it with any string...
Original comment by avrah...@gmail.com
on 20 Aug 2014 at 12:26
no by setting :
config.setUserAgentString("");
does not work for me.
Original comment by aman.dha...@gmail.com
on 3 Nov 2014 at 4:12
Set it to a known userAgent, just search for known userAgents like these from
firefox:
http://www.useragentstring.com/pages/Firefox/
So for example try this:
config.setUserAgentString("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101
Firefox/31.0");
Original comment by avrah...@gmail.com
on 3 Nov 2014 at 3:12
Original issue reported on code.google.com by
akshay22...@gmail.com
on 2 May 2014 at 6:15