Skallwar / suckit

Suck the InTernet
Apache License 2.0
747 stars 39 forks source link

Unicode handling of --include and --exclude #145

Open mr-bo-jangles opened 3 years ago

mr-bo-jangles commented 3 years ago

So my specific usecase here is attempting to mirror a site with a lot of directories of various languages, but skipping the static files at a higher level.

Example Folder Structure

/Static/<collection of unwanted static files>
/Assets/<collection of unwanted static files>
/Books/
       ./ -> /Books/
       ../ -> /
       ===/<directory tree of unwanted static files>
       121/<directory tree of static files>
       Help/<directory tree of static files>
       مساعدة/<directory tree of static files>
       Помощь/<directory tree of static files>

I want to be sure that by running a command similar to suckit https://domain.tld -i "/Books/[a-Z0-9]+/" I will download the Tree under /Books/ while excluding anything under ./, ../, and ===/

Skallwar commented 3 years ago

This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost

Skallwar commented 3 years ago

@mr-bo-jangles Did it worked ?

raphCode commented 2 years ago

Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches? I think this would lead to more transparency what suckit is doing. I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

Skallwar commented 2 years ago

Maybe we can add an option to output URL filtering information to stdout or a file

Good idea

I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

What do you mean?

raphCode commented 2 years ago

What do you mean?

To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash> parameter. Otherwise the same pages get downloaded over and over again with different sid hashes. If you want to take a look: https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191

I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.

Skallwar commented 2 years ago

The problem with removing parameters such as ?sid is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages

raphCode commented 2 years ago

In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid parameter value. One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.

I actually just found a different solution, namely to send session cookies, which avoids ?sid parameters getting appended to links in the first place.

Skallwar commented 2 years ago

We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove

Vec<(regex, Vec<parameter>)>

But it might be really costly