asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Does it crawl every site only site? My crawler is not crwaling after 355 sites #307

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. tried to crawl some event site
2.
3.

What is the expected output? What do you see instead?
after 355 websites it stops running telling that there are no more sites 
altough they are present

What version of the product are you using?
3.5

Please provide any additional information below.

Original issue reported on code.google.com by mansiawa...@gmail.com on 17 Sep 2014 at 7:00

GoogleCodeExporter commented 9 years ago
I want to ask does it take care to visit every site only once although links 
are present at more than one place

Original comment by mansiawa...@gmail.com on 17 Sep 2014 at 7:01

GoogleCodeExporter commented 9 years ago
Please post here the site you wanted to crawl and your controller so I can 
check your scenario.

Maybe you put some restriction in the "ShouldVisit"?

Original comment by avrah...@gmail.com on 18 Sep 2014 at 5:38

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I just want to know one thing that does this crawler takes care to not visit 
same page again? or we have to explicitly take care of it

Original comment by mansiawa...@gmail.com on 18 Sep 2014 at 6:55

GoogleCodeExporter commented 9 years ago
Any update?

Original comment by mansiawa...@gmail.com on 18 Sep 2014 at 8:48

GoogleCodeExporter commented 9 years ago
Hey its urgent , Can u plz help?

Original comment by mansiawa...@gmail.com on 18 Sep 2014 at 8:52

GoogleCodeExporter commented 9 years ago
The crawler saves in it's internal DB each and every URL it crawls so it won't 
crawl the same URL twice

So if you have references to a specific URL in many pages - it will visit that 
page only once!

Original comment by avrah...@gmail.com on 18 Sep 2014 at 9:41

GoogleCodeExporter commented 9 years ago
ok then i am not understanding why the crawler is getting stopped after 
crawling 355 pages
I am using http://schedule.sxsw.come as seed but it getting stopped after 
crawling 355 urls altough there are many

Original comment by mansiawa...@gmail.com on 18 Sep 2014 at 11:01

GoogleCodeExporter commented 9 years ago
I wanted to test your scenario but this URL seems to be down now

Original comment by avrah...@gmail.com on 22 Sep 2014 at 2:39

GoogleCodeExporter commented 9 years ago
Does the problem still occur ?

If there will be no activity in this issue I will have to close it.

Original comment by avrah...@gmail.com on 5 Dec 2014 at 8:52

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 8 Dec 2014 at 3:50