ceholden / open-geo-tutorial

Tutorial of basic remote sensing and GIS methodologies using open source software (GDAL in Python or R)
Other
245 stars 289 forks source link

question about data subsets! #8

Open lwasser opened 3 years ago

lwasser commented 3 years ago

hey there @ceholden i really love the random forest tutorial that you created! i've been working on a new version of it for my courses and would love to eventually publish it on our website - earthdatascience.org with you as a co author if you are interested / open to this? i wondered if you by chance had the original data that were used? i'd like to start from the way landsat is normally packaged. I could recreate a new data set but of course making training data takes time!

If you want to have a look at what i'm doing you can view a draft here. https://ipynb.pub/view/eac8fe9fe71b355c79edf0c719a4546ae73b0ebfbbf39ff5e3e7225a43d5d1ca#displayOptions=enable-annotations

ceholden commented 3 years ago

Hi @lwasser! Thank you for the nice feedback and for taking this old lesson into the 2020s! I really love seeing how others remix this sort of stuff using new libraries, and I fully support you republishing this lesson elsewhere. I would be happy to be added as a co-author, but I'm afraid I don't have much time outside work for this sort of fun stuff anymore. Not sure I would have time going forward to be deserving of "co-author", feels like "adapted from lesson by @ceholden" is a better descriptor.

I don't have the raw data that I used to generate the subset anymore, but so much has changed since I wrote this to make getting the data easier. I think I might have run the atmospheric correction (LEDAPS) on this myself or otherwise had to manually ask the USGS to do it. Since then the USGS has done some amazing work making these "Level 2" products available, especially with Landsat Collection 2 which has these data immediately available through conventional methods like EarthExplorer (no more batch processing requests to wait for!)

One route to getting the modern equivalent of the raw data would be to use EarthExplorer to select the "Landsat 7 ETM+ C2 L2" data set and filter by "Landsat Scene Identifier" of "LE70220491999322EDC01". You'll need a free USGS account to download, but otherwise you can download the entire scene as a bundle in only a few clicks. Hosting these raw images for this lesson is a little difficult though because they're ~600MB in size. Happy to send it or help you download it if you have ideas for hosting!

With Collection 2 of Landsat data, they're now also housing it on Amazon Web Services (AWS) in a "requester pays" S3 bucket (usgs-landsat - see this dataset landing page for more). Having an AWS account with billing setup to access the data may be very out of scope for what you're thinking, but these days we could include an entire workflow to get this example subset without downloading much. These data are formatted as "Cloud Optimized GeoTIFFs" (see https://www.cogeo.org/) which enables users to efficiently read just the subsets they're interested in, no more downloading huge images if you don't need to. A lot of other geospatial raster datasets have been added to AWS this way, including Sentinel-2, Sentinel-1, Harmonized Landsat-Sentinel (coming soon with v2!), MODIS NBAR products, NAIP aerial photos, elevation data, and lots of climate and weather data from NOAA and others.

I had some time and wrote up a little notebook example demonstrating how to read and regenerate this subset using Collection 2 imagery from AWS. The output is a little different because the raw product has changed so much (subset I wrote is scaled & cloud QA band is a little different), but looks like the image I remember! https://gist.github.com/ceholden/98a19102ef0916308d5b0c900b2d141d

Sorry I'll probably be slow to reply during the week, but hope this helps with your course!