It is my whishlist :-)
Please, can you include these two classes in your engine. To ease the URL
filtering process. A take this from nutch package and changed this a bit to fit
my needs (initially it was a nutch plugin - now it is standalone).
You can filter URL with regexps by using '+' or '-' to include or exclude URLs.
files:
- "regex-urlfilter.crawl.txt": one example i use during crawling ;
- "RegexRule.java" and "RegexURLFilter.java": the two main classes ;
- "SampleCrawler.java": the sample crawler ;
I hope it will help.
Regards,
Emmanuel
Original issue reported on code.google.com by zygolech...@gmail.com on 13 May 2013 at 6:54
Original issue reported on code.google.com by
zygolech...@gmail.com
on 13 May 2013 at 6:54Attachments: