dshean / demcoreg

Utilities for DEM and point cloud co-registration
MIT License
110 stars 42 forks source link

Create a `demcoreg_init.sh` wrapper for all of the `get_*.sh` scripts #17

Open dshean opened 4 years ago

dshean commented 4 years ago

Ideally, we would fetch all of these layers (e.g., NLCD, bareground) on the fly through services like "Earth on AWS" registry: https://aws.amazon.com/earth/

At present, they still require local download, extraction and processing.

We should give the user the option to get all demcoreg layers in one shot, or instructions on how to run the necessary get_*.sh script. Right now, when a user runs dem_align.py it starts with a bunch of downloads - no good.

Alternatively, we could include all auxiliary data in a docker image, or store ourselves in the cloud. Should discuss with @scottyhq

ShashankBice commented 4 years ago

Based on discussion with @scottyhq, I am listing what data we fetch, whether they are available on "Earth on AWS", and if not available, then from where we download them without a login.

None of these is currently available on "Earth on AWS", so maybe including auxiliary data in a docker image is the better option. Just to confirm, all this will come into play if we set up a binder hub for lightweight computations, right @dshean ? Or is the thought that here that the user can fetch the relevant compressed .tif files stored on the docker image directly ?

Note: The repository does have shell scripts to fetch these locally.

dshean commented 4 years ago

OK, thanks for checking. The idea was to have them ready to go "locally" in the docker image with all of the necessary dependencies.

Also, it's not just about downloading the files, there are also some processing steps in the shell scripts to prepare for dem_align.py or dem_mask.py. Some of these steps may be less relevant for newer versions of the products, and could probably come up with a better solution for combining all relevant RGI region shp.

scottyhq commented 4 years ago

After talking with @ShashankBice I spent a couple hours with the ASP docker image and ran through the demcoreg beginners doc. To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;) badge

Alternatively, we could include all auxiliary data in a docker image

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

scottyhq commented 4 years ago

I didn't try all the get_*.sh scripts, (just nlcd, rgi, and bareground) and it looks like bareground hosting URL changed:

Downloading bare2010.zip
--2020-04-29 04:27:35--  http://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Resolving edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)... 152.61.136.26, 2001:49c8:4000:122c::26
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip [following]
--2020-04-29 04:27:35--  https://edcintl.cr.usgs.gov/downloads/sciweb1/shared/gtc/downloads/bare2010.zip
Connecting to edcintl.cr.usgs.gov (edcintl.cr.usgs.gov)|152.61.136.26|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-04-29 04:27:35 ERROR 404: Not Found.

Unzipping bare2010.zip
Archive:  bare2010.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of bare2010.zip or
        bare2010.zip.zip, and cannot find bare2010.zip.ZIP, period.
ls: cannot access 'bare2010/bare2010_v3/*bare2010_v3.tif': No such file or directory
dshean commented 4 years ago

Thanks for taking a look. I remember this coming up in Jan 2020 in email thread with @cmcneil-usgs. Here are my notes:

Hmmm. Yeah, looks like the USGS landcover site disappeared. I found them on UMD site: https://glad.umd.edu/dataset/global-2010-bare-ground-30-m. looks like they’ve posted individual tif tiles here https://glad.umd.edu/Potapov/Bare_2010/ If you want to update get_bareground.sh to download and clean up these tif, that would be great! As stopgap, I pushed the original bare2010.zip file to Google Drive here: https://drive.google.com/file/d/1YDaaOm7aWG1URH8eviIYr69d-7ZwD8pj/view?usp=sharing I see updated forest cover products here http://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html. The bareground is just the thresholded inverse of forest cover percentage. So could also be useful to create a new script to download and process these data, which provide more flexibility with timestamps.

dshean commented 4 years ago

Embedding in the image could be practical fro data volumes< 1Gb, but it seems all these datasets could easily be 10Gb+. So my suggestion is to let users run get_X.sh as-needed or host "analysis-ready" data (unzipped, etc) externally on S3 or elsewhere. Perhaps some code refactoring could allow streaming only portions of these global datasets from agency servers or FTP locations.

@scottyhq, I agree with all of these thoughts. Simplest solution is to maintain get_*.sh and have better doc on initial setup. But then the user has to download and store locally, and they potnet. If we can prepare and host core data layers on S3, that would be great, esp if it falls under existing project with credits. Would be nice if providers/agencies took the lead on this, but not going to hold my breath.

For the datasets with tif tiles on the web (like the new link for bareground dateset), I expect we could prepare and distribute a vrt that would do the trick. Anybody want to do a quick test with a few tiles?

dshean commented 4 years ago

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;) badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

dshean commented 4 years ago

though we should disable RGI glacier masking by default (I'll create separate issue).

Done in https://github.com/dshean/demcoreg/commit/bd48b4f36354d7b40966ba0ec89c906ac7ecdd3a

ShashankBice commented 4 years ago

To some extent we already have a nice solution for the preconfigured computing environment, I just tried with geohackweek tutorial contents dem_align.py -mode nuth tutorial_contents/raster/data/rainier/20080901_rainierlidar_10m-adj.tif tutorial_contents/raster/data/rainier/20150818_rainier_summer-tile-0.tif and you can too ;) badge

I launched pangeo binder and while upload speed is not great, successfully uploaded a >100 MB DEM. Seems like we can recommend this for new users who have one-off application. though we should disable RGI glacier masking by default (I'll create separate issue).

Playing around with fresh install and successfully ran the Rainier DEM samples from the geohackweek raster tutorial (great idea!). Let's keep hacking on this and update the README/doc with a simple example...

I will add this example as an extension to the ASP DEM tutorial over the weekend.

dshean commented 4 years ago

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

ShashankBice commented 4 years ago

Sounds great @ShashankBice! Probably best to keep it separate from the core ASP processing tutorial though - modular is good. What if we had a separate tutorial in demcoreg?

makes sense :) !

dshean commented 4 years ago

To some extent we already have a nice solution for the preconfigured computing environment

@scottyhq I think you're using https://github.com/uw-cryo/asp-binder-dev/blob/master/binder/postBuild

Looks like it pulls latest source from github and does dev install. Strangely, I'm not seeing latest commits when launching via pangeo binder. Firing up terminal and running dem_align.py -h still shows default -mask_list ['glaciers']. Is this a caching issue?

scottyhq commented 4 years ago

Good catch, it was a a bit of a hack solution to try things out. Anything in postBuild (somewhat confusingly) is baked into the image at build time. BinderHub doesn't rebuild an image if the repo hasn't changed, so one solution is to edit the readme, or add a comment to any file to trigger rebuilding.

I'm trying moving those pip install commands to the start script, which are run when the image launches, that probably is the easiest way to get the lastest src code when testing things out.

That seems to work @dshean, you can keep using the same binder link and you'll have the latest from github :

jovyan@jupyter-uw-2dcryo-2dasp-2dbinder-2ddev-2d23hlp096:/srv/dshean/demcoreg$ git status
On branch master
Your branch is up to date with 'origin/master'.
dshean commented 4 years ago

Nice! That makes sense, and seems like a good solution. Thanks for looking into it!