cogeotiff / www.cogeo.org

Cloud Optimized GeoTIFF website
http://cogeo.org
Apache License 2.0
90 stars 39 forks source link

How do you write a range request? #59

Open ahalota opened 1 year ago

ahalota commented 1 year ago

I'm trying to take advantage of COG format to write a python script to download only the data inside of a state border for example "california.shp". The data I want to download is hosted on AWS as requester pays, so it's important that I only download exactly what I need.

I have gone several pages in on google, and despite all the pages saying how great COG is because of the range requests, I just can't find a single real example showing how to do this.

So, given a shapefile, and an aws data COG data source (for example, naip-imagery), how would you do this? It would be great to have example scripts on this website (or anywhere, really).

This is the code I have so far. (Along with some creative suggestions from ChatGPT that didn't work out, I'm assuming it has just as little information to feed into the model as I did when trying to figure it out...)

import boto3
import shapefile
from shapely.geometry import Polygon, Point

# Read in shapefile, which is in WGS84
california_border = Polygon(shapefile.Reader("california.shp")).shapes()[0].points)

# Connect to AWS
client = boto3.client(
    "s3",
    aws_access_key_id=<MY_ACCESS_KEY>,
    aws_secret_access_key=<MY_SECRET_ACCESS_KEY>,
    region_name="us-west-2"
)

# List objects in the "naip-analytic" dataset
objects = client.list_objects_v2(
    Bucket="naip-analytic",
    Prefix="ca/2020/60cm/rgbir_cog/32114", #One out of 6 folders available for California, this holds 43 .tif files
    MaxKeys=1000,
    RequestPayer="requester"
)

for obj in objects["Contents"]:
    print(obj["Key"])
    # Get object metadata
    metadata = client.head_object(
        Bucket="naip-analytic",
        Key=obj["Key"],
        RequestPayer="requester" 
    )

    coords = metadata["Metadata"] #This doesn't actually hold any information (suggested by ChatGPT). Full output below

    #Everything below is non-working suggestions from ChatGPT, left in case they inspire any good ideas
    xmin = float(coords["s3:xmin"])
    ymin = float(coords["s3:ymin"])
    xmax = float(coords["s3:xmax"])
    ymax = float(coords["s3:ymax"])

    # Check if image intersects the California border
    if california_border.intersects(Polygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax)])):
        # Calculate the overlap between the image and the California border
        overlap = california_border.intersection(Polygon([(xmin, ymin), (xmax, ymin), (xmax, ymax), (xmin, ymax)]))
        # Calculate the start and end bytes of the overlap
        start_x, start_y = overlap.exterior.coords[0]
        end_x, end_y = overlap.exterior.coords[2]
        start_byte = int((start_x - xmin) / (xmax - xmin) * obj["Size"]) #This was ChatGPT's guess. I am not sure if this is approrpiate
        end_byte = int((end_x - xmin) / (xmax - xmin) * obj["Size"])
        # Create a range request for the image
        url = f"https://naip-analytic.s3.amazonaws.com/{obj['Key']}?AWSAccessKeyId={client.meta.config.access_key}&Signature={client.meta.config.secret_key}&Expires={client.meta.config.signature_version}"
        headers = {
            "Range": f"bytes={start_byte}-{end_byte}"
        }
        print(url)

My metadata response from AWS looks like this:

{'ResponseMetadata': {'RequestId': 'BCZWDXWE7BJ1444B', 'HostId': 'BnfupuW09mBf8wBWNCYt2sLFLl78sY3niZ+TKZx6UkGZqCIFk3uNLCqPRG76I3qvAEWpkLLhR3s=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'BnfupuW09mBf8wBWNCYt2sLFLl78sY3niZ+TKZx6UkGZqCIFk3uNLCqPRG76I3qvAEWpkLLhR3s=', 'x-amz-request-id': 'BCZWDXWE7BJ1444B', 'date': 'Thu, 05 Jan 2023 17:23:55 GMT', 'x-amz-request-charged': 'requester', 'last-modified': 'Fri, 28 Jan 2022 16:31:34 GMT', 'etag': '"3357a147f64f23e7b8471760b7faf242-44"', 'accept-ranges': 'bytes', 'content-type': 'image/tiff', 'server': 'AmazonS3', 'content-length': '361935973'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2022, 1, 28, 16, 31, 34, tzinfo=tzutc()), 'ContentLength': 361935973, 'ETag': '"3357a147f64f23e7b8471760b7faf242-44"', 'ContentType': 'image/tiff', 'Metadata': {}, 'RequestCharged': 'requester'}

I'm hoping someone here is familiar enough with the format to point me in the right direction, it's disappointing that something so simple isn't described anywhere.

I'm not sure where to find: 1) The AWS-hosted COG's bounding box coordinates (the ones above ["s3:xmin"] etc. do not appear when I run the code) 2) How do I calculate which bytes I need? 3) Can I get more specific and only get the exact bytes within my border, or do I have to do a bounding box?

Thanks! I really appreciate any help.

vincentsarago commented 1 year ago

checkout https://rasterio.readthedocs.io/en/latest/topics/masking-by-shapefile.html https://cogeotiff.github.io/rio-tiler/advanced/feature/

But that will only explain how to read a raster (COG). You have other issues to solve:

You don't need to write the range-request, rasterio/GDAL to do that for you

ref: https://github.com/kylebarron/naip-cogeo-mosaic https://github.com/developmentseed/cogeo-mosaic/blob/main/docs/src/examples/Create_a_Dynamic_RtreeBackend.ipynb

ahalota commented 1 year ago

Thanks for the links, I hadn't seen those before.

I'm trying to understand this in the most basic form possible (the naip-cogeo-mosaic link looks interesting, but I'd prefer not to get into another file format such as the mosaic.json described there).

After looking through those links, I'm still not clear on how I would proceed here.

I used the code above to grab my list of urls, and saved to "ca_urls.txt". A subset of a few here:

https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
https://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

I see several options for what should go in place of "https://" such as "s3://" or "/vsis3/", but don't know what the difference is.

Now I'm still not sure what to do with this. I looked at gdalbuildvrt, but couldn't find a configuration that actually generated a vrt. gdalbuildvrt -input_file_lst ca_urls.txt ca_mosaic.vrt returns this error for each line: "Warning 1: Can't open . Skipping it"

What is the overlap with the vrt and the mosaic.json project? They sound similar, so if I could just use the gdal vrt format instead that would be preferrable to minimize extra libraries.

Would I then need to feed the vrt file into rio_tiler Reader, where it is treated like a regular file and I can then follow the example you linked (https://cogeotiff.github.io/rio-tiler/advanced/feature/) at which point my file will be downloaded and saved?

Once I have my vrt file, should a command like this work? gdalwarp -cutline INPUT.shp INPUT.tif OUTPUT.tif (at this point I would expect to actually download the data and pay for it)

vincentsarago commented 1 year ago

I see several options for what should go in place of "https://" such as "s3://" or "/vsis3/", but don't know what the difference is.

because the naip bucket is requester pays, you'll need to use the s3:// prefix and set AWS_REQUEST_PAYER= requester in you environment.

This should fix the "Warning 1: Can't open . Skipping it" issue.

What is the overlap with the vrt and the mosaic.json project? They sound similar, so if I could just use the gdal vrt format instead that would be preferrable to minimize extra libraries.

MosaicJSON is kinda like a VRT but optimized for dynamic tiling. For your use case VRT will be enough.

Would I then need to feed the vrt file into rio_tiler Reader, where it is treated like a regular file

Yes

and I can then follow the example you linked (https://cogeotiff.github.io/rio-tiler/advanced/feature/) at which point my file will be downloaded and saved?

Rio-tiler is not meant to save output file, you'll better served with gdalwarp command line as you mentioned.

ahalota commented 1 year ago

Thanks for your help, it feels so close! I really appreciate it.

I now have my list of urls in "ca_32113_urls.txt":

s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
s3://naip-analytic.s3.amazonaws.com/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

And my command

gdalbuildvrt -input_file_list .\ca_32114_urls.txt ca_32114_urls.vrt 
--config AWS_REGION "us-west-2" 
--config AWS_ACCESS_KEY_ID <my_access_key> 
--config AWS_SECRET_ACCESS_KEY <my_secret_key> 
--config AWS_REQUEST_PAYER "requester"

But am still getting "Warning 1: Can't open . Skipping it" When you say said to set this in the environment, does it have to be a standard windows environment variable, or would you expect it to work as shown in the command above

vincentsarago commented 1 year ago

try with :

s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
s3://naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif
ahalota commented 1 year ago

That one didn't work either.

vincentsarago commented 1 year ago

🤦 my bad

it should be

/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

GDAL uses /vsis3/ prefix and not s3://

ahalota commented 1 year ago

Ah, I tried that one already as well, didn't work unfortunately.

vincentsarago commented 1 year ago

it worked for me 🤷‍♂️

ahalota commented 1 year ago

Ah, that's a good start!

Did you run the same command I had listed above, or do you have the environment variables set locally? Are you using Windows or Linux?

vincentsarago commented 1 year ago
$ cat test.txt 
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif
/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif

$ gdalbuildvrt -input_file_list test.txt ca_32114_urls.vrt --config AWS_REGION "us-west-2" --config AWS_REQUEST_PAYER "requester"
0...10...20...30...40...50...60...70...80...90...100 - done.

$ cat ca_32114_urls.vrt 
<VRTDataset rasterXSize="39990" rasterYSize="13100">
  <SRS dataAxisToSRSAxisMapping="1,2">PROJCS["NAD83 / UTM zone 11N",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4269"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-117],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","26911"]]</SRS>
  <GeoTransform>  7.1059800000000000e+05,  5.9999999999999998e-01,  0.0000000000000000e+00,  3.6265380000000000e+06,  0.0000000000000000e+00, -5.9999999999999998e-01</GeoTransform>
  <VRTRasterBand dataType="Byte" band="1">
    <ColorInterp>Red</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>1</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="2">
    <ColorInterp>Green</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>2</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="3">
    <ColorInterp>Blue</ColorInterp>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>3</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <VRTRasterBand dataType="Byte" band="4">
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211419_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10680" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10680" ySize="12440" />
      <DstRect xOff="0" yOff="660" xSize="10680" ySize="12440" />
    </SimpleSource>
    <ComplexSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_ne_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10710" RasterYSize="12450" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10710" ySize="12450" />
      <DstRect xOff="29280" yOff="0" xSize="10710" ySize="12450" />
      <NODATA>0</NODATA>
    </ComplexSource>
    <SimpleSource>
      <SourceFilename relativeToVRT="0">/vsis3/naip-analytic/ca/2020/60cm/rgbir_cog/32114/m_3211420_nw_11_060_20200524.tif</SourceFilename>
      <SourceBand>4</SourceBand>
      <SourceProperties RasterXSize="10700" RasterYSize="12440" DataType="Byte" BlockXSize="512" BlockYSize="512" />
      <SrcRect xOff="0" yOff="0" xSize="10700" ySize="12440" />
      <DstRect xOff="19520" yOff="230" xSize="10700" ySize="12440" />
    </SimpleSource>
  </VRTRasterBand>
  <OverviewList resampling="nearest">2 4 8 16 32</OverviewList>
</VRTDataset>

I'm on Mac OS, I have couple env set but it shouldn't make any difference for accessing the files

$ env | grep "GDAL"
GDAL_CACHEMAX=75%
GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES
GDAL_HTTP_MULTIPLEX=YES
GDAL_INGESTED_BYTES_AT_OPEN=32768
GDAL_DATA=/opt/homebrew/share/gdal
GDAL_HTTP_VERSION=2

GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR is maybe the most important, but this will only speed up the process

ahalota commented 1 year ago

Do you have additional AWS environment variables set? I saw you omitted AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, so I'm guessing they are set as environment variables.

vincentsarago commented 1 year ago

I'm guessing they are set as environment variables.

yes they are in my env by default so I omitted those

ahalota commented 1 year ago

I tried a few more things, and it seems something was corrupted with the my input file list, "ca_32114_urls.txt".

I had created that one by copy and pasting the text in from prior command line output, then saving it via Notepad++. I created a new file, test.txt and copied and pasted in the same text from the previous file and it worked. I also tried out a few different configurations as far as the file name (underscores, digits, etc), but none of these seemed to recreate the issue. Renaming the original file did not resolve the error on it, no matter what I did I cannot make that one file work. If I make a copy of the original file, it also does not work.

With the new, uncorrupted urls.txt file, I was able to create my ca_mosaic.vrt Then I can pass this into gdalwarp to download exactly the subset I am looking for. I ran it at a very low resolution initially to see if it was providing the data I wanted.

gdalwarp -cutline "california_boundary.shp" ca_32114_urls.vrt ca_32114_3pics.tiff 
-tr 600 -600 
--config AWS_REGION_NAME "us-west-2" 
--config AWS_REQUEST_PAYER "requester"
--config AWS_ACCESS_KEY_ID <my_access_key> 
--config AWS_SECRET_ACCESS_KEY <my_secret_key> 

Thank you so much for your help with this!