climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Readme of climate-mirror/datasets needs updating #278

Open gabefair opened 7 years ago

gabefair commented 7 years ago

https://github.com/climate-mirror/datasets is missing infomation about the wget commands we are using and needs updating in general

siennathesane commented 7 years ago

feel free to submit a PR. 👍

jetbalsa commented 7 years ago

a Torrent RSS Feed managed by the maintainers would be nice, Would allow for some Auto Seeding of data to get more data spread out

baobrien commented 7 years ago

What wget commands are you guys using?

gabefair commented 7 years ago

@baobrien , Here is an example for http crawls: wget --mirror --warc-file=www.bvo-dmo.org.warc --warc-cdx \ --page-requisites --html-extension --convert-links \ --execute robots=off --directory-prefix=. --span-hosts \ --domains=bco-dmo.org,usjgofs.whoi.edu \ --exclude-domains=mapservice.bco-dmo.org \ --user-agent='Mozilla (mailto:flyingmana@googlemail.com)' \ --wait=10 --random-wait http://www.bco-dmo.org/data

gabefair commented 7 years ago

wget -N -m /*

wantonwonton commented 7 years ago

Some notes on wget options:

-r/--recursive - The maximum number of levels defaults to 5!

-l/--level - Specifies how deep the recursion should go. You can specify "inf" for infinite recursion.

-N/--timestamping - For files previously downloaded, downloads them again if the remote file timestamp has changed.

-m/--mirror - Equivalent to -r -N -l inf --no-remove-listing. (The last option keeps .listing files, which contain the raw directory listings from the FTP server.)

-c/--continue - Treats each previously downloaded file as possibly incomplete and requests downloading any data past the end of the file (if the server supports it). This is good for resuming the download of a single large file where the download was interrupted (as long as the file has not changed). For files which have changed, unless the changes are only appended to the ends of the files, this option could result in a corrupted files (by combining the first half of a file from a previous download with a second half that doesn't match the first half).

nickrsan commented 7 years ago

Agreed with mxplusb - we'll definitely act on a Pull Request that improves any of the documentation, including one on specific commands to run and tools to use. It'd probably be best to make a new markdown file and reference it in the main readme as a table of contents, but whatever you submit that improves it is welcomed.