Exclude URL paths from being reconstructed and/or crawled

GoogleCodeExporter commented 9 years ago

It is sometimes useful to exclude URLs in order to reduce the scope of the 
reconstruction job. 

Example: A site where every page has an 'edit' URL. These pages should be 
excluded from even being crawled in the first place, for example by excluding 
the /edit pattern.

Example: A site that has many browsing and searching paths that lead to the 
same content pages. These pages should be crawled to ensure complete coverage 
of the content pages, but not reconstructed, for example by excluding a pattern 
such as ?page=[0-9]+

Example: Only wanting to reconstruct a section of a website (everything under a 
particular subdirectory) by excluding specific other subdirectories from 
reconstruction.

Because excluding crawling and excluding reconstruction solve separate 
use-cases (although excluding crawling obviously also excludes reconstruction), 
I recommend separate command-line switches for each.

Original issue reported on code.google.com by mich...@urbsly.com on 27 Apr 2012 at 3:03

GoogleCodeExporter commented 9 years ago

Original comment by jbrunell...@gmail.com on 1 May 2012 at 10:33

Changed state: Started

GoogleCodeExporter commented 9 years ago

Breaking this into two parts:
The first is the subdirectory issue: only recovering from a particular 
subdirectory and not creeping into parent directories.

The second is the ability to exclude patterns of resources, such as the 
?page=... example in the feature request. I've started the first.

Original comment by jbrunell...@gmail.com on 1 May 2012 at 10:36

GoogleCodeExporter commented 9 years ago

Just to be clear:

1. *Crawling* parent directories may still be necessary to ensure complete 
coverage of the subdirectory contents for reconstruction.

2. IMHO, excluding a pattern is the general capability of which excluding 
particular directories is a specific use-case

So I still think that the best form for these capabilities is something like 
the following:

  --donotcrawl PATTERN
  --donotreconstruct PATTERN

...where PATTERN is a regex that is matched against the directory path starting 
from the site root (that is, from the initial /, not including the domain).

Of the two, --donotreconstruct is more immediately useful, where --donotcrawl 
is more of a performance optimization.

It might also be nice to be able to specify a file with several patterns, one 
per line, rather than using a --donot switch several times in the commandline.

Original comment by mich...@urbsly.com on 1 May 2012 at 1:09

GoogleCodeExporter commented 9 years ago

Completed the -sd flag (indicating that Warrick should only recover content in 
the specified subdirectory)

For example:
if you provide http://myfolia.com/plants/, warrick will only recover things 
that come from the myfolia.com/plants subdirectory (no going up to the parents).

I also provided the -ex|--exclude <FILE> feature. This will allow you to 
provide a file of regular expressions that you want to exclude from the 
recovery. For example, my test file looks like this:
myfolia\.com\/plants\/3581.*
staticweb\.archive\.org\/.*
myfolia\.com\/plants\/search\?page=.*

Meaning:
1) I don't want anything that starts with myfolia.com/plants/3581
2) I don't want any stylings or JS from the archive
3) I don't want any search pages from the plants subdirectory

so I can call warrick as:
perl warrick.pl -sd -ex /home/jbrunelle/regex.in http://myfolia.com/plants

And I won't get any of the URLs matching the REGEXs.

Original comment by jbrunell...@gmail.com on 4 May 2012 at 2:51

Changed state: Fixed

TheProjecter / warrick

Exclude URL paths from being reconstructed and/or crawled #6