Making a focused crawler based on the page content?

GoogleCodeExporter commented 9 years ago

Hi,

I want to make this a focused web crawler based on the content of a page it 
retrieves. So for instance once the web page A is downloaded, I have a 
classifier and it determines if the page A is relevant to my domain or not; 
outlinks of that page will be added to the frontier iff it is considered 
relevant. 

I don't think this is possible by changing the 'shouldVisit()' method (as in 
post id 119) 'coz that method deals only with the URL and I want to make my 
decisions based on the page content. 

Where should I'll be looking in the code in order to integrate this?

your help is really appreciated. Thanks.

Original issue reported on code.google.com by smsa...@gmail.com on 10 May 2012 at 5:40

GoogleCodeExporter commented 9 years ago

I have the same question, is there a solution to this?

Original comment by mrkevind...@gmail.com on 2 Aug 2013 at 3:08

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 1:50

Changed state: Invalid
Added labels: ****
Removed labels: ****

bgarrels / crawler4j

Making a focused crawler based on the page content? #151