Letractively / flaxcrawler

Automatically exported from code.google.com/p/flaxcrawler
0 stars 0 forks source link

Sample crawl hangs #2

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
The sample crawl from the project homepage hangs at the end of the crawl. 

It seems there is no stop condition for the TaskQueueWorker. A fix might be to 
check in TaskQueueWorker.doWorkLoop() if the task queue is empty and stop the 
thread if this condition has been met. 

Original issue reported on code.google.com by matthijs...@gmail.com on 30 Dec 2010 at 11:11

GoogleCodeExporter commented 8 years ago
There is a stop condition. The problem is just sample's code.
Changed code on the homepage, now it should work ok

Original comment by ay.mesh...@gmail.com on 7 Jan 2011 at 10:49

GoogleCodeExporter commented 8 years ago
How did you change the sample's code? My crawler still hangs.

Original comment by office%g...@gtempaccount.com on 27 Apr 2011 at 7:07

GoogleCodeExporter commented 8 years ago
I've added two lines:

    // Joining crawler worker threads and waiting for 60 seconds
    crawlerController.join(60000);
    // Stopping crawler
    crawlerController.stop();

In your project you can just join worker thread without timeout in the main 
thread and call "stop" method from another thread when you need to stop 
executing. stop() method is synchronous.

Original comment by ay.mesh...@gmail.com on 27 Apr 2011 at 7:23

GoogleCodeExporter commented 8 years ago
I need to collect the results of the crawler, so I have to find out if the 
crawler has finished.
Is checking CrawlerController.getTasksCount() == 0  in a loop reliable?
Or is there another way to find out if the crawler has done the work?

Can one add new seed URL after the controller has been started 
or is better to start new Controller + Crawlers ?

Original comment by office%g...@gtempaccount.com on 27 Apr 2011 at 8:11

GoogleCodeExporter commented 8 years ago
As I see it now -- there's no good way to check if the crawler has finished.

You may use CrawlerController.getTasksCount() == 0 condition, but it is not 
enough.

As a workaround try to use this condition: if "afterCrawl" method has not been 
called for some time (a minute for example) and 
CrawlerController.getTasksCount() == 0 than you can be 100% sure that is has 
finished.

About seed URLs - they used only when crawler starts first time, so you should 
start new controller to crawl new seed urls.

Original comment by ay.mesh...@gmail.com on 27 Apr 2011 at 8:27

GoogleCodeExporter commented 8 years ago
I have added new download with the last build. crawler-1.0.ZIP is old and has 
some bugs.

Original comment by ay.mesh...@gmail.com on 27 Apr 2011 at 11:58