Crawl always starts from server root

Bren9393 / skipfish

Automatically exported from code.google.com/p/skipfish

Apache License 2.0

0 stars 0 forks source link

Crawl always starts from server root #193

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

When calling skipfish as:

./skipfish -o ../out https://test.com/foo/bar/baz.html

The crawler always starts from https://test.com/, ignoring path and parameters 
(and from looking at the code in database.c, it seems it does this every time a 
link points to a new host).

I'd like to submit a patch to change this behavior (through a command-line 
switch), but before I do that I'd like to know the rationale for current code, 
in order not to break any useful use case.

Best regards,
Mattia

Original issue reported on code.google.com by mattiaba...@gmail.com on 1 Jul 2013 at 8:45

GoogleCodeExporter commented 8 years ago

There is a separate command-line parameter to limit the scan to a specific path 
(or exclude specific paths). Without it, the scanner simply takes any number of 
"seed" URLs in the command line, but it brute-forces the entire site. All of 
them should still get crawled, just not right away.

Original comment by lcam...@google.com on 1 Jul 2013 at 8:56

GoogleCodeExporter commented 8 years ago


To expand on what Michal said: Using -I /foo/bar/ for explicit inclusion allows 
the active testing to be limited to /foo/bar/*

Are you concerned about / or /foo/ being actively tested ? This should not 
happen with -I. Or is there a different problem ?

Original comment by niels.he...@gmail.com on 2 Jul 2013 at 8:18

GoogleCodeExporter commented 8 years ago

Original comment by niels.he...@gmail.com on 17 Nov 2013 at 8:16

Changed state: Invalid