creare-com / podpac

Pipeline for Observational Data Processing Analysis and Collaboration
https://podpac.org
Apache License 2.0
43 stars 6 forks source link

rasterio requires CURL_CA_BUNDLE environmental variable to open s3 paths on some systems #458

Open jmilloy opened 3 years ago

jmilloy commented 3 years ago

Description Rasterio fails to open s3 paths on some systems due to missing SSL certificates. The exception is completely misleading

According to https://github.com/mapbox/rasterio/commit/b621d92c51f7c2021f89cd4487cecdd7c201f320, libcurl on linux expects the ssl certificates to be at the CentOS default /etc/pki/tls/certs/ca-bundle.crt, but on other systems they will be at other locations.

Steps to Reproduce

On an Ubuntu system:

import podpac
node = podpac.data.Rasterio(source='s3://noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003')
node.dataset

You can produce the error without podpac as well:

import rasterio
session = rasterio.session.AWSSession(region_name='us-east-1', aws_unsigned=True)
with rasterio.env.Env(session=session) as env:
    dataset = rasterio.open('s3://noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003')

Observed Behavior

WARNING:rasterio._env:CPLE_AppDefined in HTTP response code on https://noaa-gfs-pds.s3.amazonaws.com/SOIM/0-10%20m%20DPTH/20210315/1200/003.xml: 0
Traceback (most recent call last):
  File "rasterio/_base.pyx", line 216, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 67, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 213, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsis3/noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jmilloy/Creare/Pipeline/podpac/podpac/core/utils.py", line 398, in wrapper
    value = fn(self)
  File "/home/jmilloy/Creare/Pipeline/podpac/podpac/core/data/rasterio_source.py", line 78, in dataset
    dataset = rasterio.open(self.source)  # This should pull AWS credentials automatically
  File "/home/jmilloy/Creare/Pipeline/_podpac-38_/lib/python3.8/site-packages/rasterio/env.py", line 433, in wrapper
    return f(*args, **kwds)
  File "/home/jmilloy/Creare/Pipeline/_podpac-38_/lib/python3.8/site-packages/rasterio/__init__.py", line 221, in open
    s = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
  File "rasterio/_base.pyx", line 218, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsis3/noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003' not recognized as a supported file format.

Additional Notes

Currently, podpac (and rasterio) have a sort of workaround that uses HTTP instead of HTTPS:

node = podpac.data.Rasterio(source='s3://noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003', aws_https=False)

which translates to

session = rasterio.session.AWSSession(region_name='us-east-1', aws_unsigned=True)
with rasterio.env.Env(session=session, AWS_HTTPS=False) as env:
    dataset = rasterio.open('s3://noaa-gfs-pds/SOIM/0-10 m DPTH/20210315/1200/003')
jmilloy commented 3 years ago

@mpu-creare I pulled this out to its own issue.

The correct solution is to set CURL_CA_BUNDLE on some systems. I see two options.

  1. podpac raises a better exception. On RasterioIOError, check for the existence of the file using s3fs and if it exists, raise an exception that suggest setting CURL_CA_BUNDLE.

  2. podpac sets CURL_CA_BUNDLE when necessary. On my system, you can use curl-config --ca to get the correct path, so I think we could do this automatically. We could probably do this on any linux system without adverse affect, and include a trait to disable setting it, for informed users doing something special in rare situations.

Rasterio obviously has chosen to document and otherwise punt on this issue, but I think very few people are going to be able to figure out that exception and the required fix.

mpu-creare commented 3 years ago

At least 1 should help.

For 2, you could see if 'CURL_CA_BUNDLE' is already in the environmental vriables, and if not try the curl-config --ca trick.