earthlab / earthpy

A package built to support working with spatial data using open source python
https://earthpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
500 stars 160 forks source link

EarthlabData.get_data(url = ...) may be broken #254

Closed mbjoseph closed 5 years ago

mbjoseph commented 5 years ago

Describe the bug

The use of the url argument to the get_data method of the EarthlabData class seems to not be working. I have tried a variety of URLs that point to different kinds of files and I always get a KeyError when trying to access the content-disposition field in the header of the response.

To Reproduce

Using a URL to a zip file raises the KeyError.

import earthpy.io as eio
d = eio.EarthlabData()
d.get_data(url = 'https://ndownloader.figshare.com/files/10960109.zip')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-7-b501fa2d71be> in <module>
----> 1 d.get_data(url = 'https://ndownloader.figshare.com/files/10960109.zip')

~/Documents/earthpy/earthpy/io.py in get_data(self, key, replace, url)
    142             fname = None
    143             r = requests.head(url)
--> 144             content_disposition = r.headers["content-disposition"].split(";")
    145             for c in content_disposition:
    146                 if c.startswith("filename="):

~/.local/lib/python3.6/site-packages/requests/structures.py in __getitem__(self, key)
     50 
     51     def __getitem__(self, key):
---> 52         return self._store[key.lower()][1]
     53 
     54     def __delitem__(self, key):

KeyError: 'content-disposition'

This happens for every URL that I have tried so far, including:

d.get_data(url = 'https://ndownloader.figshare.com/files/10960109.zip')
d.get_data(url = 'https://www.google.com/robots.txt')
d.get_data(url = 'https://github.com/earthlab/earthpy/archive/master.zip')
d.get_data(url = 'https://raw.githubusercontent.com/earthlab/earthpy/master/earthpy/example-data/continental-div-trail.geojson')

Expected behavior

I would expect that for valid data types (e.g., files, zip files, tar, and tar.gz files), those files would be downloaded and I would get the path(s) to the data.

What Operating System Are you Running?

DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"

Additional context

After digging into this more, it looks like the content-disposition is not returned in the responses to any of the above requests, and I can't find an equivalent part of the returned header that can be used to determine the file name. Maybe I'm missing something, or using the url argument incorrectly. @betatim might know.

For example:

import requests
r = requests.head('https://github.com/earthlab/earthpy/archive/master.zip')
dict(**r.headers)

which returns

{'Server': 'GitHub.com',
 'Date': 'Fri, 08 Mar 2019 20:10:31 GMT',
 'Content-Type': 'text/html; charset=utf-8',
 'Status': '302 Found',
 'Vary': 'X-PJAX',
 'Location': 'https://codeload.github.com/earthlab/earthpy/zip/master',
 'Cache-Control': 'max-age=0, private',
 'Set-Cookie': 'has_recent_activity=1; path=/; expires=Fri, 08 Mar 2019 21:10:31 -0000, logged_in=no; domain=.github.com; path=/; expires=Tue, 08 Mar 2039 20:10:31 -0000; secure; HttpOnly, _gh_sess=RkVkd29hU3pPVjkxUHd4RUJ1S0laRjdVbklFeVpiV2dEcE1ORkYrTXEzK1RCYmVaN1hjajc0UnRia0t2d1dlUWUyS3pVdVJSY1M4U0hjMEZ6SU56eVA4N2JTSmVKcUw5Yko0amtwY0ZmWlQrczlkTStpWWdzZVNPbGo2aXRRcTJicUEzTTZNSzhtMHArRHU3WmdEQTRDcEJlOXNGSWlrS2xnV2FMR2VUVlNMOW1VNDA4cW5Ya1BzSytkME5obHg1RlQxbzZxaCt4NzlsVWJ1S1VGSVNjQT09LS0wZmRXL0dGRndnWWJsKzNUdkw2M3V3PT0%3D--e041e8eb0c4172824db99361f7d668f8d2be8b01; path=/; secure; HttpOnly',
 'X-Request-Id': '27ba6a04-beaf-4a30-9780-c3c2c090a902',
 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload',
 'X-Frame-Options': 'deny',
 'X-Content-Type-Options': 'nosniff',
 'X-XSS-Protection': '1; mode=block',
 'Expect-CT': 'max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"',
 'Content-Security-Policy': "default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com",
 'X-GitHub-Request-Id': '86A5:336E:32D5D:5BC7C:5C82CC37'}
mbjoseph commented 5 years ago

An update: I noticed now that I can get this to work for data on fig share by not providing a URL directly to the file, e.g.,

d.get_data(url='https://ndownloader.figshare.com/files/7010681') # works

rather than

d.get_data(url='https://ndownloader.figshare.com/files/7010681.zip') # does not work

I'm going to close this, and will update the docs & tests to reflect this intended usage.