untitaker commented 3 years ago

suckit crashes when mirroring docs.sentry.io to disk. In the interest of reproducability and speed, a zipfile with all static contents is attached which for simulation purposes could be served with python3 -mhttp.server or http-server or nginx (I recommend the npm package, it is easy to use and fast), and the site is open-source so it can be built locally.

public.zip -- this file is copyright Sentry and licensed under the BSL.

unzip public.zip
http-server public/ &
suckit -c -j200 http://localhost:8080/ -o copy-of-public/

thread '<unnamed>' panicked at 'Couldn't create suckit/localhost/platform-redirect/index.html?next=/configuration/releases/: Is a directory (os error 21)', src/logger.rs:42:9

Unrelated:

I'd expect the query strings to be stripped or replaced in filenames and links, because when serving the mirrored sites, nginx will ignore them too.
The output is also not pretty: All worker threads print panics because of poisoned locks.
-c is ignored because it's a panic.

Skallwar commented 3 years ago

Thanks for reporting all of this. At the moment I am very busy with projects for my different courses.

I will look into this when I will have free time. I will mark this as a good first issue

untitaker commented 3 years ago

I will try to work on this + the other issues I file.

Skallwar commented 3 years ago

Awesome!!

untitaker commented 3 years ago

117 fixes the immediate panic. Now I get this:

thread '<unnamed>' panicked at 'Url http://localhost:8080/clients/cocoa/dsym/ was not found in the path map', src/logger.rs:42:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/panicking.rs:435:5
   2: suckit::logger::Logger::error
   3: suckit::scraper::Scraper::handle_url

I probably need help here, I do not understand the data flow from/into path map.

Skallwar commented 3 years ago

I probably need help here, I do not understand the data flow from/into path map

The goal of the path map is to attribute an unique path to an url at the time we are discovering new urls. This way we can cache the result of the url -> path conversion. The problem here is that as the url we are working on line 180 have already been encountered before and should have been added line 190 when we were downloading and fixing the page with a reference to the page we are currently working on.

I don't know if I'm clear enough :sweat_smile: . Feel free to tell me if I'm not

Skallwar / suckit

suckit crashes when mirroring docs.sentry.io #116

117 fixes the immediate panic. Now I get this: