crawler4j dont crawl some sites

dwachira / crawler4j

Automatically exported from code.google.com/p/crawler4j

0 stars 0 forks source link

crawler4j dont crawl some sites #257

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.i crawl a few sites , a lot of okey but a few not . ex: morhipo.com
2.why i couldnt crawl morhipo.com
3.

What is the expected output? What do you see instead?
crawler couldnt go any link .  after start its not working (already i sout 
"hello" its not show it also)

What version of the product are you using?
3.5

Please provide any additional information below.
its robots.txt : 

User-agent: *
Sitemap: http://www.morhipo.com/sitemap
Allow: /
User-agent: Googlebot-Image
Disallow:

how can i crawl that sites ? On where i make mistake ?

Original issue reported on code.google.com by muhammet...@gmail.com on 10 Mar 2014 at 12:22

GoogleCodeExporter commented 9 years ago

My config : 

CrawlConfig config = new CrawlConfig();
                config.setCrawlStorageFolder(rootFolder);
                config.setMaxPagesToFetch(100000);
                config.setPolitenessDelay(1);
                config.setUserAgentString(" Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405");

                PageFetcher pageFetcher = new PageFetcher(config);
                RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
                robotstxtConfig.setEnabled(false);
                RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
                CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

                controller.addSeed(http://www.morhipo.com);
                controller.start(Morhipo.class, numberOfCrawlers);

                 System.out.println("hello");

Original comment by muhammet...@gmail.com on 10 Mar 2014 at 12:23

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:50

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Tested it with the basic crawler example.

It works!

Everything gets crawled (except several pages which get server 500 error)

Please try again with latest build from trunk and report back.

Original comment by avrah...@gmail.com on 20 Aug 2014 at 12:37

GoogleCodeExporter commented 9 years ago

Closed due to inactivity and no good scenario

Original comment by avrah...@gmail.com on 23 Sep 2014 at 2:13

Changed state: Invalid