Debugging crawl in Sparkler

karanjeets commented 8 years ago

URL Partitioner

Input: Query Solr for the URLs to be generated

status:NEW

Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host

Input: URL to fetch Output: Request and Response Headers written in a file

Input: URL (which will be fetched and parsed) OR the fetched content Output: Extracted Content

Input: List of URLs. Uses Crawl policy Output: fetched and/or parsed content in separate files under a directory

rahulpalamuttam commented 8 years ago

@karanjeets @thammegowda

Thamme suggested to output debug info to log.debug. Considering this, lets not output content to log. The rest is fair game.

thammegowda commented 8 years ago

@rahulpalamuttam Yes, content will be overkill for debugging.

We need tools to invoke pieces of crawl pipeline (like fetch, parse stages) from Command line for testing/debugging.

thammegowda commented 6 years ago

Suggested tools are log.debug and debugger tool in IDE (I use intellij Idea). Closing this, will reopen if better tools are needed.