Open thammegowda opened 6 years ago
great job Thamme! If I may this is "focused language crawling" as opposed to e.g., "focused multimedia crawling" or "web page crawling" etc. We should update the issue title to reflect that. Great job filing the issue.
Thanks for the suggestion. the title is now updated 👍 Focus crawling is needed for everybody, but no existing crawler seems to do it right. we/sparkler now has the thinking cap for this task, we will propose a good solution for languages, multimedia, etc..
Yeah - this could be really cool!
The first task is defining and expressing the forcus crawling specification. The second subtask will be implementing that specification in sparkler.
Currently, we have support for URL based focus/filters. this has to be advanced with content-based focus.
Example task can be:
Sparkler should be able to express and accept this first 'focus' requirement, which is a combination of two filters: