ContinuumIO / elm

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data
http://ensemble-learning-models.readthedocs.io
43 stars 27 forks source link

Deprecate the elm-data repo in favor of S3 #146

Closed PeterDSteinberg closed 7 years ago

PeterDSteinberg commented 7 years ago

Installing the example / test data from elm-data is not a hard requirement of elm. It is used in Travis CI currently and some of the example notebooks. We will move this to S3 because it currently involves installing git LFS which can involve a few git commands / system installs some users are less familiar with. Use the datashader/examples/ idea for downloading sample data.

gbrener commented 7 years ago

Sounds great. I'll create a new bucket and post details internally once I have them.

PeterDSteinberg commented 7 years ago

Also relevant to this issue, @gbrener , is a module in what is now earthio related to downloading from Amazon's LANDSAT S3 store. I made it in preparation for AnacondaCON demo presentation. At that time, I also added this more general LANDSAT util that is related to spatial and band metadata, not S3 and downloading. Neither of these LANDSAT related modules are affected by the earthio PR 1, except by file move from elm/readers/ to earthio/

gbrener commented 7 years ago

Ok, good to know @PeterDSteinberg - thanks for the link. Before seeing your comment I wrote a similar script (https://github.com/ContinuumIO/elm-readers/blob/cd89b89108f0542d26b77044c4d4fb7a68b1ca63/scripts/download_test_data.py), except it's a bit more generic (although it currently expects files in the .tar.bz2 format. I wrote it with extensibility/flexibility in mind though, so we can always add more formats). The choice for bzip2 over gzip was pretty arbitrary - conda uses bzip2 for packaging, but gzip is more ubiquitous - so I'm happy to switch to gzip if you have a strong preference for it. Please let me know your thoughts on whether I should combine the two scripts, and/or change the compression to gzip.

PeterDSteinberg commented 7 years ago

@gbrener, One idea is keeping s3_landsat_util.py in place with some LANDSAT specific stuff or alternatively moving the LANDSAT specific code from s3_landsat_util.py to landsat_util.py and the S3 part of s3_landsat_util.py to your download_test_data.py script - up to you what is easier. The s3_landsat_util.py I mentioned is for specifically downloading from AWS their LANDSAT store with logic specific to finding scenes a file called scene_list.gz downloaded from AWS, while yours is geared toward the move of what is now elm-data to S3 buckets we control. Here's the SceneDownloader class that we could optionally move to landsat_util.py or your download_test_data.py or keep in place. I also have some code from a notebook that finds the lowest cloud cover image out of scene_list.gz from the AWS LANDSAT store - https://aws.amazon.com/public-datasets/landsat/ . I can commit my notebook to elm-examples soon, then generalize and commit those changes in this project later. I think the AWS LANDSAT store (among others) is a good test data set for us because it is already a large data set of interest that is highly available without our maintenance and there are interesting problems that can be done with a smaller subset of the data.

gbrener commented 7 years ago

Ok, sounds good - I'm fine with keeping them separate.

gbrener commented 7 years ago

Just so this is written down somewhere - in order to fully deprecate elm-data, we'll need to update the documentation so that it no longer references that repo. Based on an offline conversation with @PeterDSteinberg , we're planning to do this at a later to-be-determined date.

PeterDSteinberg commented 7 years ago

TODO items/PRs remaining before we close this issue:

PeterDSteinberg commented 7 years ago

I deleted elm-data repo, but I downloaded elm-data and zipped its latest contents first just to be safe. This logic is now handled by S3 downloading. See also #134