climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

Planetary Data System #12

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

Name: Planetary Data System Organization: NASA Description URL: https://pds.nasa.gov/ Download URL: File Types: Size: Status:

blueacid commented 7 years ago

Some of the files seem to link to http://pds-geosciences.wustl.edu --looks like this could be scraped using wget fairly trivially

kaie commented 7 years ago

Just saw the initiative, don't have much time to help, but thought I'd at least try to crawl something that isn't claimed yet. I've started two parallel crawls: (1) wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds.nasa.gov http://pds.nasa.gov/ (2) wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds-geosciences.wustl.edu http://pds-geosciences.wustl.edu/

Let's see if 150 GB storage are enough?

Suggestions for better commands welcome.

ghost commented 7 years ago

Remarks: NASA is okay but wustl.edu is at risk? Y'might want to go easier on those hosts by setting appropriate "--wait=" and "--random-weight" delays. If you don't there's a chance of missing some files. Also, we're trying to HELP these sites, not DoS them.

See the "Best Practices" page of Wiki for The Azimuth Backup Project for more tips.

On Wed, Jan 18, 2017, at 07:52, Kai Engert wrote:

Just saw the initiative, don't have much time to help, but thought I'd at least try to crawl something that isn't claimed yet. I've started two parallel crawls:

(1)

wget --mirror --page-requisites --convert-links -e robots=off -H - Dpds.nasa.gov http://pds.nasa.gov/ (2)

wget --mirror --page-requisites --convert-links -e robots=off -H -Dpds- geosciences.wustl.edu http://pds-geosciences.wustl.edu/ Let's see if 150 GB storage are enough?

Suggestions for better commands welcome.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/12#issuecomment-273467484
  2. https://github.com/notifications/unsubscribe-auth/AD3HB72A7mEZaoUa9cOGV7EeocSiwUZ7ks5rTgr0gaJpZM4LhVAf
kaie commented 7 years ago

I couldn't find the "best practices" page you mentioned. I've added --wait=1 --random-wait

kaie commented 7 years ago

Do you think it's unnecessary to mirror http://pds-geosciences.wustl.edu/ ? If yes, I'll stop that.

kaie commented 7 years ago

Note that pds-geo*.wustl.edu seems big, but fast. I've already downloaded 1650 files, 2.5 GB from that host. So the restart with the wait will create a delay, before the download will continue.

pds.nasa is much slower. In the same time, it had downloaded just 36 MB, 560 files.

ghost commented 7 years ago

I'd use "--wait=5" at least.

On Wed, Jan 18, 2017, at 08:23, Kai Engert wrote:

I couldn't find the "best practices" page you mentioned.

I've added --wait=1 --random-wait

— You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/12#issuecomment-273473825
  2. https://github.com/notifications/unsubscribe-auth/AD3HB46FBfhWAkHYeJdYCSYpjyA1xXqiks5rThI-gaJpZM4LhVAf
ghost commented 7 years ago

I can't speak to "necessary", and it's just my opinion, but I think the effort would be better used on something else.

On Wed, Jan 18, 2017, at 08:24, Kai Engert wrote:

Do you think it's unnecessary to mirror http://pds-geosciences.wustl.edu/ ? If yes, I'll stop that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/12#issuecomment-273474056
  2. https://github.com/notifications/unsubscribe-auth/AD3HB8Z7wX6khzJ0-Y35jc4NZBLzit_Vks5rThKFgaJpZM4LhVAf
ghost commented 7 years ago

Due to a mishap which was entirely my fault, we had a copy of this and then it was inadvertently deleted.

Is there anyone out there who has a fat pipe and a fast server that I can appeal to to try to download it again?

kaie commented 7 years ago

I have an attempted mirror, with the following sizes: $ du -sm * 226 atmos.pds.nasa.gov 1740 geo.pds.nasa.gov 130 img.pds.nasa.gov 6 mgmt.pds.nasa.gov 1 mirror.sh 1 naif.pds.nasa.gov 405 pds.nasa.gov 332 ppi.pds.nasa.gov 1 rings.pds.nasa.gov 100192 sbn.pds.nasa.gov

Unfortunately I'm having issues with wget. It had always crashed very early on a server I was using in datacenter, always at the same step around 36 GB down.

Now, on my local Fedora 25 system with the most recent wget, it progressed further, but it still aborted with: wget: memory exhausted

Also, that server sends Last-modified header missing -- time-stamps turned off. for many pages, but maybe it's limited to the index pages.

Is the apparently unstable wget really our best choice for creating mirrors?

Would you like me to upload what I have? Or, if you can wait, I'll restart, and let it update the already local mirror.

(As advised earlier, I stopped the download of the wustl.edu page.)

ghost commented 7 years ago

Just something I discovered, and am trying it out. httrack is now available open source for Linux, available using yum install httrack and presumably sudo privs. I am trying it on a .gov site, using, for example:

httrack "https://podaac.jpl.nasa.gov" -O . --mirror --depth=8 --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 \
        --retries=30 --host-control=0 TN 60 --near --robots=0 %s
as-com commented 7 years ago
$ wget --mirror --page-requisites --convert-links -e robots=off --span-hosts -Dpds.nasa.gov --warc-cdx --warc-file="pds.nasa" --continue --no-verbose --backup-converted --adjust-extension https://pds.nasa.gov/
...
Cannot write to ‘sbn.pds.nasa.gov/holdings/ro-c-osinac-2-esc2-67pchuryumov-m15-v1.0/DOWNLOAD/roosi_1109.tgz’ (No space left on device).
FINISHED --2017-01-22 01:58:04--
Total wall clock time: 2h 26m 18s
Downloaded: 16819 files, 12G in 49m 36s (3.98 MB/s)
$ zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot   926G   267G   659G         -    25%    28%  1.00x  ONLINE  -

???

as-com commented 7 years ago

It looks like at least some of the site is mirrored: https://mirrors.asun.co/climate-mirror/pds/

as-com commented 7 years ago

Also, on ftp://pds-geosciences.wustl.edu/, there seems to be at least 12.3 TiB of data, and I left FileZilla counting for the past 10 hours. So there's a ton of data in there.

kaie commented 7 years ago

And I just found that http://www.jpl.nasa.gov/copyrights.php states that reproducing needs permission.

Given that there apparently is a mirror already and my attempts haven't been fruitful, I'll stop my work on this ticket, and have already deleted my local copy.

as-com commented 7 years ago

Please don't delete anything - it may be useful if my server goes down or my mirror is garbage, etc.

ghost commented 7 years ago

No doubt. That's why it's important to grab.

On Sun, Jan 22, 2017, at 10:14, Andrew Sun͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈xCCxAE͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮͈̮xCDx88̮͈ wrote:

Also, on ftp://pds-geosciences.wustl.edu/, there seems to be at least 12.3 TiB of data, and I left FileZilla counting for the past 10 hours. So there's a ton of data in there. — You are receiving this because you commented. Reply to this email directly, view it on GitHub[1], or mute the thread[2].

Links:

  1. https://github.com/climate-mirror/datasets/issues/12#issuecomment-274336681
  2. https://github.com/notifications/unsubscribe-auth/AD3HB4E-9Qg8qM_sD55ZI28GIRezJ51Mks5rU3JhgaJpZM4LhVAf
sakaal commented 7 years ago

I have restored a large part of what Jan accidentally deleted earlier. Work continues.