funguscolander / dotfiles

My Unix dotfiles
1 stars 0 forks source link

Make binary for downloading entire websites with `wget` #27

Closed funguscolander closed 4 years ago

funguscolander commented 4 years ago

Use the code describes in this gist to download entire websites. Maybe call it wgetfullsite?

Possibly add --no-hsts to the command to get rid of the following error:

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/$USER/.wget-hsts'. HSTS will be disabled.

This Stack Exchange Post states that HSTS is an optional security measurement of top of https. Considering the error states that it just won't apply HSTS if it cannot use it and continues with the download anyway, this is probably fine.

It would probably look something like this, except written as the long version instead of the one-liner, and with --no-hsts as described above:

# One liner
wget --recursive --page-requisites --adjust-extension --span-hosts --convert-links --restrict-file-names=windows --no-hsts --domains $site_dot_extension --no-parent $site_dot_extension

# Explained
wget \
     --recursive \ # Download the whole site.
     --page-requisites \ # Get all assets/elements (CSS/JS/images).
     --adjust-extension \ # Save files with .html on the end.
     --span-hosts \ # Include necessary assets from offsite as well.
     --convert-links \ # Update links to still work in the static version.
     --restrict-file-names=windows \ # Modify filenames to work in Windows as well.
     --no-hsts \# Disable HTTP Strict Transport Security which is an optional man-in-the-middle https security addon
     --domains yoursite.com \ # Do not follow links outside this domain.
     --no-parent \ # Don't follow links outside the directory you pass in.
         yoursite.com/whatever/path # The URL to download

Also:

--mirror instead of --recursive Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing’. Source

--no-clobber Don't overwrite any existing files (used in case the download is interrupted and resumed).

--wait=1 --random-wait Some websites block scrapers by comparing request times. This varies the wait time between requests between 0.5 and 1.5 * wait seconds. This will make wait 1 second, so betwen 0.5 and 1.5 seconds in random mode.

-e robots=off Makes Wget actually download the entire website instead of obeying the disallowed subdirectories in robots.txt. Look here.

--user-agent=Mozilla Pretend you're Mozilla Firefox and not Wget so some websites won't block you becuase you're a robot. (Might be necessary with -e robots=off?