Applied-GeoSolutions / gips

Geospatial Image Processing System
GNU General Public License v3.0
17 stars 5 forks source link

Add AWS landsat fetch #470

Closed ircwaves closed 6 years ago

ircwaves commented 6 years ago

step zero:

469

step one:

Create a fetch that uses the landsat data available in public S3 buckets.

Add a setting to the REPOS dict so that the landsat driver can use different fetching depending on that configuration. The AWS fetch should use an aws specific Asset query service, and the Landsat C1 asset will be present as a set of s3 URLs.

Also, GDAL has a vsiS3 interfase which will allow direct reading of the files out of the S3 bucket w/o explicit download step.

ra-tolson commented 6 years ago

(#469 isn't really a prereq for this task per se, it's just how we'd planned on doing it)

ra-tolson commented 6 years ago

If env vars are set, here's how gdal can access the data: gdalinfo /vsis3_streaming/landsat-pds/c1/L8/139/045/LC08_L1TP_139045_20170304_20170316_01_T1/LC08_L1TP_139045_20170304_20170316_01_T1_B8.TIF. About 35 seconds on the company network.

Note this is no faster than wget gdalinfo /vsis3_streaming/landsat-pds/c1/L8/139/045/LC08_L1TP_139045_20170304_20170316_01_T1/LC08_L1TP_139045_20170304_20170316_01_T1_B8.TIF.

ra-tolson commented 6 years ago

Env vars needing setting are AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID. Some related to performance too, but the default is close enough to us (us-east-1): AWS_DEFAULT_REGION

ra-tolson commented 6 years ago

This is a proof-of-concept/demo/prototype that lets you fetch, archive, inventory, and process a not-really-there Landsat 8 asset stored in AWS S3 :cloud: CLOOOUUUD :cloud: STORAGE:

https://github.com/Applied-GeoSolutions/gips/compare/dev...470-s3-ls-fetch

See README at top of that diff for config details. Sample NDVI generated by this code, for tile 139045 & 2017-063:

139045-2017063-l8-s3-ndvi

matthewhanson commented 6 years ago

@ags-tolson FYI - because the landsat-pds bucket is public, you don't actually need AWS creds or use boto3, you can use http. example: https://landsat-pds.s3.amazonaws.com/L8/001/002/LC80010022016230LGN00/index.html is the landing page. Each file there can be accessed directly such as with wget, or with gdal using the vsicurl driver (rather than the vsis3 driver).

ra-tolson commented 6 years ago

How do you learn that other date, which I guess is the processing date, ahead of time to be able to generate the URL? That's where I got stuck trying to do it anonymously.

ircwaves commented 6 years ago

I think what Matt is saying is that you can get to the index page from a search (using sat-search or usgs>=0.2.1) to get the scene ID and then you can scrape the index page to get the path/key of the bands and metadata.

ircwaves commented 6 years ago

With AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID set in my environment, and the settings of

 'landsat': {'6S': True,
             'MODTRAN': False,
             'extract': False,
             'password': 'my_password',
             'repository': '/var/gips-test-data/landsat',
             'source': 's3',
             'username': 'my_user_name'},

I get:

(venv) icooke@north:~/src/gips$ gips_inventory landsat \
              --fetch -t 012030 -d 2017-12-7 -v6
GIPS Data Inventory (v0.9.3-dev)
Retrieving inventory for site tiles
Landsat "DN" assets are no longer fetchable
Landsat "SR" assets are no longer fetchable
Searching AWS S3 bucket 'landsat-pds' for key fragment 'c1/L8/012/030/'
Found no complete S3 asset for (C1S3, 012030, 2017-12-07 00:00:00)
Problem fetching asset for C1S3, 012030, 17-12-07:
Traceback (most recent call last):
  File "/home/icooke/src/gips/gips/utils.py", line 571, in cli_error_handler
    yield
  File "/home/icooke/src/gips/gips/data/core.py", line 1070, in fetch
    cls.Asset.fetch(a, t, d)
  File "/home/icooke/src/gips/gips/data/landsat/landsat.py", line 557, in fetch
    cls.fetch_s3(tile, date)
  File "/home/icooke/src/gips/gips/data/landsat/landsat.py", line 460, in fetch_s3
    _30m_vsi_paths = [vsi_prefix + t for t in _30m_tifs]
TypeError: 'NoneType' object is not iterable

@ags-tolson: Are you able to fetch that scene using this branch?

ircwaves commented 6 years ago

s3_landsat_fetch_log.txt Here's the log from a fresh DH contianer, popped over to the relevant branch.

ra-tolson commented 6 years ago

Right now all it does is T1 assets; that scene is only available as an RT asset; sorry the UX isn't better (prototypes whee).

ircwaves commented 6 years ago

I see a T1: https://s3-us-west-2.amazonaws.com/landsat-pds/c1/L8/196/057/LC08_L1TP_196057_20171130_20171207_01_T1/index.html and a message in that logfile saying:

2017-12-15 21:16:58,488 Response headers: {\
  'date': 'Fri, 15 Dec 2017 21:16:58 GMT', \
  'x-amz-id-2': 'SrS/H/knvyM2TeNrqYL7aH36a2Q/OTGSiEG2JQOmMYkcVxPmvpWNbM1lY+nR0OAa+XmnadTvTMM=', \
  'server': 'AmazonS3', 'transfer-encoding': 'chunked', \
  'x-amz-request-id': 'EBE7A12CA5D6A17B', \
  'x-amz-bucket-region': 'us-west-2', \
  'content-type': 'application/xml'}
2017-12-15 21:16:58,488 Response body:
<?xml version="1.0" encoding="UTF-8"?>
<Error>
    <Code>PermanentRedirect</Code>
    <Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message>
    <Bucket>landsat-pds</Bucket>
    <Endpoint>landsat-pds.s3.amazonaws.com</Endpoint>
    <RequestId>EBE7A12CA5D6A17B</RequestId>
    <HostId>SrS/H/knvyM2TeNrqYL7aH36a2Q/OTGSiEG2JQOmMYkcVxPmvpWNbM1lY+nR0OAa+XmnadTvTMM=</HostId>
</Error>

For double coverage, I tried running this with the date that worked for you above with similar error messages.

ra-tolson commented 6 years ago

Which log file? I'm afraid I'm at a loss.

ircwaves commented 6 years ago

The one attached to the comment above: https://github.com/Applied-GeoSolutions/gips/issues/470#issuecomment-352114543

ra-tolson commented 6 years ago

Okay; I just got a good result from the scene listed here:

(venv) tolsonlt:~/src/gips$ time gips_inventory landsat -t 139045 -d 2017-063 -v4 --fetch
GIPS Data Inventory (v0.8.2)
Retrieving inventory for site tiles
Landsat "DN" assets are no longer fetchable
Landsat "SR" assets are no longer fetchable
Searching AWS S3 bucket 'landsat-pds' for key fragment 'c1/L8/139/045/'
Found complete C1S3 asset for (C1S3, 139045, 2017-03-04 00:00:00)
0...10...20...30...40...50...60...70...80...90...100 - done.
0...10...20...30...40...50...60...70...80...90...100 - done.
Attempting to load LC08_L1TP_139045_20170304_20170316_01_T1_S3.tar.gz
C1S3 asset
LC08_L1TP_139045_20170304_20170316_01_T1_S3.tar.gz -> /home/tolson/src/gips/data-root/landsat/tiles/139045/2017063/LC08_L1TP_139045_20170304_20170316_01_T1_S3.tar.gz
1 files (1 links) from /home/tolson/src/gips/data-root/landsat/stage added to archive in 0:00:00.004181
Attempting to load LC08_L1TP_139045_20170304_20170316_01_T1_S3.tar.gz

Though i'm still at a loss. : / One thing to try if you have the aws utility:

$ aws s3 ls landsat-pds/c1/L8/139/045/LC08_L1TP_139045_20170608_20170616_01_T1/
ircwaves commented 6 years ago

Well, it seems like there is probably something deployment related to sort out, but I guess I'd say to push forward and combine this with the #469 work. We probably need a small example job that fetches from $3, processes and exports a handful of products using the NHseacoast shapefile, and then runs gips_stats on the exported directories. Does that make sense? Once that can run on the local docker environment, then push that job up to AWS?

ra-tolson commented 6 years ago

Okay, sounds like a plan.

ircwaves commented 6 years ago

aws_fetch_fail_log.txt

Following up on the deployment side, it looks like awscli is happy (see attached logfile), but if I immediately fetch after aws s3 ls, I get the same issue (in both a fresh container and my machine). As I said before, we can hold off on sorting this one out for now.