climate-mirror / datasets

For tracking data mirroring progress
201 stars 18 forks source link

NOAA GIBBS #358

Open nickrsan opened 7 years ago

nickrsan commented 7 years ago

New Dataset

Please fill out the New Issue form so we can easily organise the issues and put a priority on certain data. The title should reflect the dataset you want us to save. In the body, please include the following information.

I received an email with this information as a suggestion. Some of the facts are a little off, but it may be worth looking into if an archive exists somewhere of this data.

It has come to my attention that NOAA's NCEI "National Centers for Environmental Information" GIBBS archive is being taken offline on March 31. I have been unable to find mirrors of this data from other sources.

GIBBS is a public browsing system of image data collected by the International Satellite Cloud Climatology Project. The archive contains global geostationary satellite data, from nearly every location on the Earth at three-hour intervals from 1980 to present, roughly 0.6 terabytes of images.

A message on the website urges users to send them an email about how they use GIBBS, so that they can create a suitable substitute, but knowing how these things often go, I have a lot of doubt that we'll ever see it online again. Better safe than sorry. Additionally, updates to GIBBS stopped only two months after the Presidential inauguration.

rustyguts commented 7 years ago

Any recommended wget commands for grabbing this site? @nickrsan

JeremiahCurtis commented 7 years ago

Wonder if httrack would work here

on a side note, I wonder, given this dataset, how many other NCEI data are not incorporated into https://www.ncei.noaa.gov/data/ ?

rustyguts commented 7 years ago

httrack came back with about 100mb.

kimmerin commented 7 years ago

Hi,

I started a WinHTTrack-session and it seems to work. After 3 hours it has saved 13000 files (just the calender overview pages) and about 850 MB of data.

I'll keep it running until it fails or my provider starts complaining. Problem is that I'm sitting at an asymmetric ADSL-line so I won't be able to set up a server here that can be reached. I think that's what you're calling a "private mirror" ;-)

JeremiahCurtis commented 7 years ago

Fascinating answer (?) to the question I posed earlier regarding our coverage of NCEI data:

from https://www.nodc.noaa.gov/oceanacidification/stewardship/data_portal_help.html

"National Centers for Environmental Information (NCEI) Oceans NOAA's National Centers for Environmental Information (NCEI) hosts and provides public access to one of the most significant archives for environmental data on Earth. Through the Center for Weather and Climate and the Center for Coasts, Oceans, and Geophysics, we provide over 25 petabytes of comprehensive atmospheric, coastal, oceanic, and geophysical data."

Also, and I hope I'm freaking out for nothing, but on this same page, I'm getting the following: "This National Centers for Environmental Information (NCEI) Oceans web site is currently unavailable. We apologize for any inconvenience." I can't seem to get ftp://ftp.nodc.noaa.gov/pub/ either, as I am attempting to grab some of the symlinks on that directory, and finish mirroring the archive directory

kimmerin commented 7 years ago

BTW:

Additionally, updates to GIBBS stopped only two months after the Presidential inauguration.

Make that five days. The inauguration was at Jan 20, 2017, the last image being uploaded was at Jan 25, 2017

kimmerin commented 7 years ago

Short status update. WinHTTrack is downloading HTML pages sequentially and doesn't start with the images after the download of the HTML pages. With a thoughput of a bit over 1 page per second and ca. 1.6 million pages to download that would never finish before the deadline.

Good thing is that the HTML pages are a generic creation from a script so there is no actual need to download them, you just need the information what images are available which is the case when you have the pages containing the availability (files in /gibbs/availability). The rest can be generated dynamically again with a homebrewed script.

So I switched tactics and wrote my own Java programs downloading the availability-files, extract the image-links into a file and downloading these files in parallel. At the moment I have downloaded about 691,000 images with 865,000 still to go.

If you do the math and compare the sum with the reported number of images hosted by the site, you might ask about the difference of nearly 20,000 images. At the end of this download this number will ride to more than 100,000. The organization of these images on the server seems to be less than optimal. I have the theory that an URL like https://www.ncdc.noaa.gov/gibbs/image/GRD-1/WV/2016-12-31-21 is served from a resource in a filesystem with path that looks something like /some/path//image/GRD-1/WV/2016-12-31-21. The example here is the problematic one already. There are about 200,000 images within the GRD-1-path and most filesystems' access times increase exponentially with the number of files in a directory. My personal record was 60 seconds just to check for the existance of a file (no actual access of the content) in a directory containing a million files. Attempts to download an image from the GRD-1 path leads to more and more HTTP-503-errors slowing down the download process, so I excluded them from the main download process. This increased the download throughput from 50,000 images per day to now about 200,000 images per day.

I will start a parallel download from another system only for the GRD-1 files. With one image per second this should be finished within a day but there is plenty of time before the deadline in case the throughput should go down more than that.

JeremiahCurtis commented 7 years ago

@kimmerin As of 3/20/2017, NOAA reports 1,573,397 images. Are the GRD-1 images the only ones not entirely available from the main directory, or are there others? Would be glad to help but this one appears to be a labryinth...

Also, the /availability url appears not to be working...maybe it's just me

kimmerin commented 7 years ago

@JeremiahCurtis The files are available per se. The only problem is that there seems to be some kind of timeout the server is willing to wait for the resource to be available and sends an HTTP-error 503 in case it takes to long. So you have to do multiple attempts to get the actual file. This happens across all satellite images (at the moment MET-7-images also lead to multiple 503-error before there is success) but GRD-1 images often surpassed the maximum number of attempts I programmed (20). With one second wait time between attempts it really slowed down the whole process of downloading images.

The GRD-1 images are "only" the result of a stitching of the different images leading to an image covering the whole world. Even if we're unsuccessful in downloading the GRD-1-images in time it would be possible to stitch them ourselves using the images that are downloaded.

Also, the /availability url appears not to be working...maybe it's just me

It wasn't an URL, a complete URL would be e.g. https://www.ncdc.noaa.gov/gibbs/availability/2017-03-17.html The URL is quite straight forward and it was easy to cerate a program that is generating the URLs by formatting the corresponding dates.

For what it's worth, here is the list of GRD-1-images that are excluded from my current download. grdfiles_offset.zip wc -l grdfiles_offset.txt 103328 grdfiles_offset.txt

pelerin commented 7 years ago

@kimmerin As of tonight looks like GIBBS is offline. Hope this is temporary, or that you got all the images!!

kimmerin commented 7 years ago

I hope it's temporary as well since I haven't been able to finish the download, yet. About 400,000 images are still missing.

JeremiahCurtis commented 7 years ago

Still not working......wonder if we could contact NOAA about this issue

sa7mon commented 7 years ago

Good news!

Information

The GIBBS web service will not end March 31, 2017.

Based on feedback from users, we will continue to support the GIBBS service. However, we apologize 
for any gaps in service that may occur (likely to occur in early April) as we strive to maintain the same level of service. 
entr0p1 commented 7 years ago

I've found that all of the sat Images I've looked at so far have the id of "satImage" in their tags in the HTML code. I'm working on a crawler to grab the images at the moment.

entr0p1 commented 7 years ago

Update - script is coming together, almost finished albeit a little bit hacked together. I've got it scraping some links from the site at the moment and will publish it once it's ready for everyone to use. Given the time taken to scrape the links (well over 12 hours now), I've coded in a "Seed File" function which allows the script to leverage a file that has all of the links to every satellite image pre-defined. The script can either leverage a seed file to save the massive scrape time, or it can generate its own data and work directly from the site. At each "layer" of the GIBBS website (home page/years, months/days, image selection, image display), named Layer 1-3 and SatImg for the final layer (the page displaying the image), the script writes-out a seed file for that layer. So for example, if you manage to get the second layer of links downloaded and something bombs out mid-way through layer 3 for whatever reason, you don't have to wait another 12 hours and can pick up from the start of Layer 3. I'll explain this better (hopefully) in the docs. I'll also be uploading my seed files as part of the GitHub repository for the script. I've also factored in the 503 errors by simply making it retry stuff a few times and write-out the failed ones to file for manual download using another tool I have (already written). This will all be documented and uploaded soon, stay tuned guys.

entr0p1 commented 7 years ago

Script is finished, for anyone brave enough to try it out (it's working for me at least), here it is: https://github.com/dojobel/cm_gibbs

Edit: I'm downloading the sat images now.

kimmerin commented 7 years ago

Hi, sorry for not writing (I've been on business travels so weren't able to do so). I'm also downloading again. I created a similar file (haven't looked at dojobel's scripts but I assume they more or less do the same than my Java-files). My complete image-URL-file ist ca. 100 MB big.

My download-program tells me that there are about 50,000 files still to go. I throttled the download-process a bit since there is no longer a deadline looming. With about 1 download per second on average, the whole thing should be finished tomorrow.

hawken93 commented 7 years ago

I'm currently writing a php script that parses the whole site into sqlite. In an hour or so I will have metainformation about each image, from there I plan to generate URL lists to feed to wget for the actual images.

hawken93 commented 7 years ago

Correction, that will probably be a few more hours than that. Each "availability" page takes a few seconds to dump into sql. And I don't want to try to speed that up because then I would have to make more error tolerant code.. :P

hawken93 commented 7 years ago

Finished indexing most of the metadata into sqlite. "gibbs.db" is now a database that contains information about year, day and image (corresponds to 3 levels of metainfo). The file is around 80 MB From the image table one can generate URL and file path to each image. Sitting on 63GB of pictures at the moment, but I think this will increase tenfold. As for the speed, this should complete in around 80 hours.

Currently missing information about the long name of each sattelite, as well as µm and long name of the channels. I think this information gives enough dynamic information to recreate the site..

From here one could make cron-driven scripts to automatically keep the mirror up to date..

If you already have some of the images, I could avoid taking them from this server if I could get them from you faster :)

https://pic.thehawken.org/ul/Gvky9F.png

kimmerin commented 7 years ago

Hi, I already finished downloading the images (1,574,205 images with 567 GB of data). Since the site continues its operation I think there should be new images available by now (and the recent days are updated by additional images that aren't available immediately).

With an upstream bandwith of 5 MBit/s I'm not sure how to upload this data somewhere else in reasonable time, though.

hawken93 commented 7 years ago

Ready to release my own fancy tool here:

https://github.com/hawken93/noaa-gibbs-mirror

It's not perfect, but when you run it, it generates lists of missing files, and when you rerun it, it will update and make a list of the now-missing files.

I discovered 4 files that are inconsistent and those are: https://www.ncdc.noaa.gov/gibbs/image/GOES-12/IR/2004-05-30-18 https://www.ncdc.noaa.gov/gibbs/image/GOES-12/WV/2004-05-30-18 https://www.ncdc.noaa.gov/gibbs/image/GOES-12/IR/2004-06-01-09 https://www.ncdc.noaa.gov/gibbs/image/GOES-12/WV/2004-06-01-09

They exist in the "availability" page but not when you click it. Luckily they exist as "GOE-12".

Making torrent when I can get around to that..

entr0p1 commented 6 years ago

@hawken93 just trying out your tool and it looks really neat! I've hit a snag though, when executing it with php mirror.php I get a bunch of nice flowing positive looking output, but then this happens and it exits:

PHP Warning:  rename(nifi/.links104.json,nifi/links104.json): No such file or directory in /mnt/ProjectArk/temp/noaa-gibbs-mirror/inc/nifi.php on line 24
nifi/links104.json

Any ideas on what I can do to fix that? PHP isn't really my forte.. The other files seem to rename OK and the destination file does exist:

# ls -lah nifi/ | grep links104
-rw-r--r--. 1 root root 765K Aug 13 01:05 links104.json

Have tried renaming the nifi folder but doesn't seem to fix it

hawken93 commented 6 years ago

@dojobel firstly, it really makes me happy to see others use my code :D

Oh yes, I tested my code again now. It looks like more recent versions of php have changed their garbage collection. I'll find a smarter way than to rely on the garbage collection, and I believe that will fix it

hawken93 commented 6 years ago

I believe my commit should sort that out :)

hawken93 commented 6 years ago

520GB of pictures. Tried compressing them and they are in .tar.xz files. The amount of files made the torrent way too huge..

magnet:?xt=urn:btih:4712b6b2d69e06a4c0a2f7974da3e170dd236d79

entr0p1 commented 6 years ago

@hawken93 that fixed it, thanks mate! Nice solid code btw, I've stopped using my own util in favour of this one for all the smarts that are coded in. I've executed the mirror.php which has gone to completion but doesn't seem to download anything, do I need to do something special like setting a flag somewhere to make it download the images?

hawken93 commented 6 years ago

@dojobel The http server returned lots of errors during downloading of actual images, so the script just builds a giant list of URLs you need to download separately. The images should be downloaded into the correct folder (www.ncdc.noaa.gov/path). When images are downloaded and mirror.php is rerun, the url list will shrink.

If you want to get going fast, you can use urls.txt with wget and try to make that work, but you also need to think about how you want to remove the files that failed to download properly. If you don't use nifi, then most likely urls.txt will be all you need.

hawken93 commented 6 years ago

In summary. My code does not download the actual data. The reason for this is that the webserver was pretty sketchy and there probably needs to be another coding effort to make something that really downloads and verifies all the files. (Deletes them if they are corrupt). It should probably be threaded because that will be faster. It also needs to tweak the amount of threads to try to keep the ratio of failed requests fairly low.

I also posted a magnet link to ~520GB data which means that you don't necessarily need to get all the data from their website.

hawken93 commented 6 years ago

Oh, one more comment about nifi. When mirror.php is rerun and it finds new images (you can run it as a daily job) then it should make a json file that e.g. nifi consumes. Nifi can then automatically download and store them. If this is left running, then a mirror will be maintained, updated daily :)