USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Debugging crawl in Sparkler #16

Closed karanjeets closed 6 years ago

karanjeets commented 8 years ago

URL Partitioner

Input: Query Solr for the URLs to be generated

status:NEW

Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host

Fetch

Input: URL to fetch Output: Request and Response Headers written in a file

Parse

Input: URL (which will be fetched and parsed) OR the fetched content Output: Extracted Content

Fair Fetcher

Input: List of URLs. Uses Crawl policy Output: fetched and/or parsed content in separate files under a directory

rahulpalamuttam commented 8 years ago

@karanjeets @thammegowda

Thamme suggested to output debug info to log.debug. Considering this, lets not output content to log. The rest is fair game.

thammegowda commented 8 years ago

@rahulpalamuttam Yes, content will be overkill for debugging.

We need tools to invoke pieces of crawl pipeline (like fetch, parse stages) from Command line for testing/debugging.

thammegowda commented 6 years ago

Suggested tools are log.debug and debugger tool in IDE (I use intellij Idea). Closing this, will reopen if better tools are needed.