Open nickrsan opened 7 years ago
Some of the files seem to link to http://pds-geosciences.wustl.edu --looks like this could be scraped using wget fairly trivially
Just saw the initiative, don't have much time to help, but thought I'd at least try to crawl something that isn't claimed yet. I've started two parallel crawls: (1) wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds.nasa.gov http://pds.nasa.gov/ (2) wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds-geosciences.wustl.edu http://pds-geosciences.wustl.edu/
Let's see if 150 GB storage are enough?
Suggestions for better commands welcome.
Remarks: NASA is okay but wustl.edu is at risk? Y'might want to go easier on those hosts by setting appropriate "--wait=" and "--random-weight" delays. If you don't there's a chance of missing some files. Also, we're trying to HELP these sites, not DoS them.
See the "Best Practices" page of Wiki for The Azimuth Backup Project for more tips.
On Wed, Jan 18, 2017, at 07:52, Kai Engert wrote:
Just saw the initiative, don't have much time to help, but thought I'd at least try to crawl something that isn't claimed yet. I've started two parallel crawls:
(1)
wget --mirror --page-requisites --convert-links -e robots=off -H - Dpds.nasa.gov http://pds.nasa.gov/ (2)
wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds- geosciences.wustl.edu http://pds-geosciences.wustl.edu/ Let's see if 150 GB storage are enough?
Suggestions for better commands welcome.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].
Links:
I couldn't find the "best practices" page you mentioned. I've added --wait=1 --random-wait
Do you think it's unnecessary to mirror http://pds-geosciences.wustl.edu/ ? If yes, I'll stop that.
Note that pds-geo*.wustl.edu seems big, but fast. I've already downloaded 1650 files, 2.5 GB from that host. So the restart with the wait will create a delay, before the download will continue.
pds.nasa is much slower. In the same time, it had downloaded just 36 MB, 560 files.
I'd use "--wait=5" at least.
On Wed, Jan 18, 2017, at 08:23, Kai Engert wrote:
I couldn't find the "best practices" page you mentioned.
I've added --wait=1 --random-wait
— You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].
Links:
I can't speak to "necessary", and it's just my opinion, but I think the effort would be better used on something else.
On Wed, Jan 18, 2017, at 08:24, Kai Engert wrote:
Do you think it's unnecessary to mirror http://pds-geosciences.wustl.edu/ ? If yes, I'll stop that.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].
Links:
Due to a mishap which was entirely my fault, we had a copy of this and then it was inadvertently deleted.
Is there anyone out there who has a fat pipe and a fast server that I can appeal to to try to download it again?
I have an attempted mirror, with the following sizes: $ du -sm * 226 atmos.pds.nasa.gov 1740 geo.pds.nasa.gov 130 img.pds.nasa.gov 6 mgmt.pds.nasa.gov 1 mirror.sh 1 naif.pds.nasa.gov 405 pds.nasa.gov 332 ppi.pds.nasa.gov 1 rings.pds.nasa.gov 100192 sbn.pds.nasa.gov
Unfortunately I'm having issues with wget. It had always crashed very early on a server I was using in datacenter, always at the same step around 36 GB down.
Now, on my local Fedora 25 system with the most recent wget, it progressed further, but it still aborted with: wget: memory exhausted
Also, that server sends Last-modified header missing -- time-stamps turned off. for many pages, but maybe it's limited to the index pages.
Is the apparently unstable wget really our best choice for creating mirrors?
Would you like me to upload what I have? Or, if you can wait, I'll restart, and let it update the already local mirror.
(As advised earlier, I stopped the download of the wustl.edu page.)
Just something I discovered, and am trying it out. httrack is now available open source for Linux, available using yum install httrack
and presumably sudo privs. I am trying it on a .gov site, using, for example:
httrack "https://podaac.jpl.nasa.gov" -O . --mirror --depth=8 --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 \
--retries=30 --host-control=0 TN 60 --near --robots=0 %s
$ wget --mirror --page-requisites --convert-links -e robots=off --span-hosts -Dpds.nasa.gov --warc-cdx --warc-file="pds.nasa" --continue --no-verbose --backup-converted --adjust-extension https://pds.nasa.gov/
...
Cannot write to ‘sbn.pds.nasa.gov/holdings/ro-c-osinac-2-esc2-67pchuryumov-m15-v1.0/DOWNLOAD/roosi_1109.tgz’ (No space left on device).
FINISHED --2017-01-22 01:58:04--
Total wall clock time: 2h 26m 18s
Downloaded: 16819 files, 12G in 49m 36s (3.98 MB/s)
$ zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 926G 267G 659G - 25% 28% 1.00x ONLINE -
???
It looks like at least some of the site is mirrored: https://mirrors.asun.co/climate-mirror/pds/
Also, on ftp://pds-geosciences.wustl.edu/, there seems to be at least 12.3 TiB of data, and I left FileZilla counting for the past 10 hours. So there's a ton of data in there.
And I just found that http://www.jpl.nasa.gov/copyrights.php states that reproducing needs permission.
Given that there apparently is a mirror already and my attempts haven't been fruitful, I'll stop my work on this ticket, and have already deleted my local copy.
Please don't delete anything - it may be useful if my server goes down or my mirror is garbage, etc.
No doubt. That's why it's important to grab.
On Sun, Jan 22, 2017, at 10:14, Andrew Sun͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈ wrote:
Also, on ftp://pds-geosciences.wustl.edu/, there seems to be at least 12.3 TiB of data, and I left FileZilla counting for the past 10 hours. So there's a ton of data in there. — You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].
Links:
I have restored a large part of what Jan accidentally deleted earlier. Work continues.
Name: Planetary Data System Organization: NASA Description URL: https://pds.nasa.gov/ Download URL: File Types: Size: Status: