USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 142 forks source link

Support for flexible focus language crawling framework #139

Open thammegowda opened 6 years ago

thammegowda commented 6 years ago

The first task is defining and expressing the forcus crawling specification. The second subtask will be implementing that specification in sparkler.

Currently, we have support for URL based focus/filters. this has to be advanced with content-based focus.

Example task can be:

  1. "Crawl top news in Kannada language"
  2. "Crawl sports news in XYZ language"
  3. "Crawl cooking blogs that are in XYZ language"
  4. "Crawl poetry or song lyrics in XYZ language"
  5. "Craw news about earthquakes in XYZ language"

Sparkler should be able to express and accept this first 'focus' requirement, which is a combination of two filters:

  1. language filter, often rare languages (i.e. languages that are not supported by Google translator). There are over few thousands.
  2. domain such as cooking, news, sports news etc Maybe a few tens or hundreds max.
chrismattmann commented 6 years ago

great job Thamme! If I may this is "focused language crawling" as opposed to e.g., "focused multimedia crawling" or "web page crawling" etc. We should update the issue title to reflect that. Great job filing the issue.

thammegowda commented 6 years ago

Thanks for the suggestion. the title is now updated 👍 Focus crawling is needed for everybody, but no existing crawler seems to do it right. we/sparkler now has the thinking cap for this task, we will propose a good solution for languages, multimedia, etc..

wmburke commented 6 years ago

Yeah - this could be really cool!