fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
995 stars 351 forks source link

LocalFileSystem incompatibility / corruption when used to read FITS files #613

Open joseph-long opened 3 years ago

joseph-long commented 3 years ago

When opening the same local file via open() and via fsspec.open() I get garbled floating point values from the latter when the library loads it. Confusingly, the ascii header of the file seems to come through fine, so it's at least partially loading the file... (I have implemented an iRODS interface through fsspec, and don't see the issue when opening files from iRODS, so as far as I know this is specific to local paths with fsspec.)

When read through fsspec instead of regular open(), the astropy.io.fits.open function is able to parse the header (a sort of ascii preamble) but the associated data are garbled. I'm trying to discourage astropy.io.fits from trying to memory map or seek or anything (for compatibility with other fsspec backends), hence the mode='readonly', memmap=False in the fits.open call.

I'm totally stumped, but I can reproduce with a short script, included below. It's possible this is an issue that Astropy needs to fix, but I can't figure out what would conceivably be different about the file proxy object that would let it read the header and not the data...

To reproduce:

  1. Save this script as repro.py
import numpy as np
import fsspec
from astropy.io import fits

def main():
    data = np.arange(10, dtype=float)
    fits.PrimaryHDU(data).writeto('ex.fits', overwrite=True)
    print('\nUsing builtin open()')
    with open('ex.fits', 'rb') as fh:
        hdul = fits.open(fh, mode='readonly', memmap=False)
        hdul.info()
        print(hdul[0].data)

    print('\nUsing fsspec.open() with local path')
    with fsspec.open('ex.fits', 'rb') as fh:
        hdul = fits.open(fh, mode='readonly', memmap=False)
        hdul.info()
        print(hdul[0].data)

if __name__ == "__main__":
    main()
  1. Create a virtual env

    python -m venv env
    source env/bin/activate
    pip install astropy fsspec
  2. Compare output from two approaches to open ex.fits, e.g.

$ python repro.py

Using builtin open()
Filename: ex.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       5   (10,)   float64
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

Using fsspec.open() with local path
Filename: /Users/josephlong/devel/repro/ex.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       5   (10,)   float64
[1.64931512e+093 2.86446248e-014 6.01347002e-154 6.01347002e-154
 1.15963942e-152 3.59533712e+246 2.68090142e+092 1.07357647e+243
 6.01347002e-154 6.01347002e-154]

Here's the same from my own test with iRODS / the irods_fsspec backend

Using irods_fsspec through fsspec
Filename: <class '_io.BufferedRandom'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       5   (10,)   float64
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
martindurant commented 3 years ago

I am investigating. Certainly this was known to work in the past, see https://github.com/intake/intake-astro

martindurant commented 3 years ago

This is fixed by the following change

--- a/fsspec/implementations/local.py
+++ b/fsspec/implementations/local.py
@@ -292,6 +292,8 @@ class LocalFileOpener(io.IOBase):
         return self.f.__iter__()

     def __getattr__(self, item):
+        if item == "raw":
+            raise AttributeError
         return getattr(self.f, item)

astropy tries to get the .raw attribute of the file object, presumably worried about buffering or memory mapping outside of its control, but the seek calls are now applied to the wrapper object and the underlying object separately.

I don't know if there's a good reason for hiding the raw attribute in general like this, or really why astropy is trying to use it.

joseph-long commented 3 years ago

Interesting. Thanks for investigating so quickly! Do you think you'll incorporate that change? Or should I subclass LocalFileOpener/System and make that change for my application?

Only hint I can find as to why Astropy does this is https://github.com/astropy/astropy/blob/main/astropy/io/fits/util.py#L382-L392

def fileobj_open(filename, mode):
    """
    A wrapper around the `open()` builtin.

    This exists because `open()` returns an `io.BufferedReader` by default.
    This is bad, because `io.BufferedReader` doesn't support random access,
    which we need in some cases.  We must call open with buffering=0 to get
    a raw random-access file reader.
    """

    return open(filename, mode, buffering=0)

But fsspec takes pains to support seek, if I understand correctly?

martindurant commented 3 years ago

The astropy comment doesn't make much sense to me, binary-mode files are always seekable, io.BufferedReader.seek exists.

I think it's numpy's fromfile method that's probably at fault, and astropy uses .raw to determine if the file is appropriate to pass to this function or not (use frombuffer(f.read()) instead). So in the first version, numpy was doing something deeper, and the seek location between the low-level file handle and the buffered wrapper got mixed up. I can't tell where or why. It should be fine to extract the raw file handle for those that want to - I'd rather not forbid it.

joseph-long commented 3 years ago

Interesting. Thanks for looking in to it. How would you suggest working around this? Can i register my own subclass as the LocalFileSystem fallback, or should i monkeypatch, or something else?

martindurant commented 3 years ago

I think it should be raised as an issue with astropy. I have tried, and numpy.fromfile works just fine with (buffered) file objects created by fsspec - so I don't know why the check is necessary, and fsspec shouldn't need to block it. (not that I understand how the file position pointers became out of sync)

martindurant commented 3 years ago

It appears np.fromfile calls os.fspath on the file. I cannot tell where from (this happens in C code), but it suggests that maybe numpy if re-opening the file for its own use, and that's why the location pointer doesn't update.