Closed karanjeets closed 6 years ago
@karanjeets @thammegowda
Thamme suggested to output debug info to log.debug. Considering this, lets not output content to log. The rest is fair game.
@rahulpalamuttam Yes, content will be overkill for debugging.
We need tools to invoke pieces of crawl pipeline (like fetch, parse stages) from Command line for testing/debugging.
Suggested tools are log.debug and debugger tool in IDE (I use intellij Idea). Closing this, will reopen if better tools are needed.
URL Partitioner
Input: Query Solr for the URLs to be generated
Output: Files with a list of URLs partitioned by host (group) such that every file corresponds to one host
Fetch
Input: URL to fetch Output: Request and Response Headers written in a file
Parse
Input: URL (which will be fetched and parsed) OR the fetched content Output: Extracted Content
Fair Fetcher
Input: List of URLs. Uses Crawl policy Output: fetched and/or parsed content in separate files under a directory