Open mr-bo-jangles opened 3 years ago
This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost
@mr-bo-jangles Did it worked ?
Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches? I think this would lead to more transparency what suckit is doing. I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.
Maybe we can add an option to output URL filtering information to stdout or a file
Good idea
I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.
What do you mean?
What do you mean?
To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash>
parameter. Otherwise the same pages get downloaded over and over again with different sid
hashes.
If you want to take a look:
https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191
I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.
The problem with removing parameters such as ?sid
is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages
In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid
parameter value.
One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.
I actually just found a different solution, namely to send session cookies, which avoids ?sid
parameters getting appended to links in the first place.
We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove
Vec<(regex, Vec<parameter>)>
But it might be really costly
So my specific usecase here is attempting to mirror a site with a lot of directories of various languages, but skipping the static files at a higher level.
Example Folder Structure
I want to be sure that by running a command similar to
suckit https://domain.tld -i "/Books/[a-Z0-9]+/"
I will download the Tree under/Books/
while excluding anything under./
,../
, and===/