Ultimately I'd like a command line utility where the arguments are:
start url
depth of the crawl
whether PDFs, word documents, or images are scraped
time limit
maximum number of files to download
I'd like the scraper to output the html and other files into a directory with their original names, ideally retaining the directory structure of the website
Ideally it would handle redirects from the staring URL gracefully