fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.03k stars 359 forks source link

FSSpec 0.8.4 is unable to read data from some http & https sites with aiohttp 3.7.0 where it succeeds with 3.6.2 #458

Closed holdenk closed 4 years ago

holdenk commented 4 years ago

here is the repro steps:

of = fsspec.open('https://data.githubarchive.org/2020-10-01-2.json.gz', compression='gzip')
with of as f:
     print(f.readline())

I get the exception:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in enter f = self.fs.open(self.path, mode=mode) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open f = self._open( File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 216, in _open size = self.size(path) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 608, in size return self.info(path).get("size", None) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 262, in _info raise FileNotFoundError(url) FileNotFoundError: https://data.githubarchive.org/2020-10-01-2.json.gz

Note if I download the file with wget (e.g. wget https://data.githubarchive.org/2020-10-01-2.json.gz) it succeeds.

From running wget on this it seems that there is a 301 redirect to https://data.gharchive.org/2020-10-01-2.json.gz , but even manually following that redirect I get the same error.

My aiohttp version is '3.7.0' If I change my aiohttp version to 3.6.2 I recreated this issue on ARM64 and AMD64

martindurant commented 4 years ago

Are you able to recreate the aiohttp call alone, perhaps from logging? If that also errors, then, we may need to make an issue on the aiohttp tracker.

On October 24, 2020 10:12:52 PM EDT, Holden Karau notifications@github.com wrote:

here is the repro steps:

of = fsspec.open('https://data.githubarchive.org/2020-10-01-2.json.gz',
compression='gzip')
with of as f:
    print(f.readline())

I get the exception:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in enter f = self.fs.open(self.path, mode=mode) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open f = self._open( File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 216, in _open size = self.size(path) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 608, in size return self.info(path).get("size", None) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 262, in _info raise FileNotFoundError(url) FileNotFoundError: https://data.githubarchive.org/2020-10-01-2.json.gz

Note if I download the file with wget (e.g. wget https://data.githubarchive.org/2020-10-01-2.json.gz) it succeeds.

From running wget on this it seems that there is a 301 redirect to https://data.gharchive.org/2020-10-01-2.json.gz , but even manually following that redirect I get the same error.

My aiohttp version is '3.7.0' If I change my aiohttp version to 3.6.2 I recreated this issue on ARM64 and AMD64

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/intake/filesystem_spec/issues/458

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

holdenk commented 4 years ago

I haven't tried, how would I up the log level?

On Sun, Oct 25, 2020 at 6:03 PM Martin Durant notifications@github.com wrote:

Are you able to recreate the aiohttp call alone, perhaps from logging? If that also errors, then, we may need to make an issue on the aiohttp tracker.

On October 24, 2020 10:12:52 PM EDT, Holden Karau < notifications@github.com> wrote:

here is the repro steps:

of = fsspec.open('https://data.githubarchive.org/2020-10-01-2.json.gz',
compression='gzip')
with of as f:
print(f.readline())

I get the exception:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in enter f = self.fs.open(self.path, mode=mode) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 897, in open f = self._open( File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 216, in _open size = self.size(path) File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 608, in size return self.info(path).get("size", None) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/http.py", line 262, in _info raise FileNotFoundError(url) FileNotFoundError: https://data.githubarchive.org/2020-10-01-2.json.gz

Note if I download the file with wget (e.g. wget https://data.githubarchive.org/2020-10-01-2.json.gz https://data.githubarchive.org/2020-10-01-2.json.gz) it succeeds.

From running wget on this it seems that there is a 301 redirect to https://data.gharchive.org/2020-10-01-2.json.gz , but even manually following that redirect I get the same error.

My aiohttp version is '3.7.0' If I change my aiohttp version to 3.6.2 I recreated this issue on ARM64 and AMD64

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/intake/filesystem_spec/issues/458

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intake/filesystem_spec/issues/458#issuecomment-716248296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOT5OHX4JJDU6M577IMKLSMTDFFANCNFSM4S574DDQ .

-- Cell : 425-233-8271

martindurant commented 4 years ago

There is a logger called "fsspec.http", you would need t configure it yourself. Or else, you can use pdb to go up the stack and see what call fsspec is making.

holdenk commented 4 years ago

Cool I’ll try next weekend (for now I’m just using my downgrade to aiohttp)

martindurant commented 4 years ago

aiohttp 3.7.0 was very quickly followed by 3.7.1 - perhaps the problem fixed itself?

EDIT: I tried your code snippet, and it seems to work on 3.7.1

holdenk commented 4 years ago

That's great to here :) I'll close the issue :)