fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

numpy.fromfile from https #39

Closed youngsaj closed 5 years ago

youngsaj commented 5 years ago

Question: I have a final hurdle in getting https files read, metadata gets read correctly, just np.fromfile doesn't like the fid. It's not clear to me if readbytes is a close replacement. Since we end up looping, using np.fromfile and fid.seek I suspect there's a cleaner way now using fsspec to subsample (seek to where we need). What's the best np.fromfile replacement to use fsspec?

martindurant commented 5 years ago

It seems to work for me:

x = np.ones(10)
x.tofile('mydata')
file = fsspec.open('file:///Users/mdurant/mydata', 'rb')
with file as f:
     x = np.fromfile(f)
youngsaj commented 5 years ago

How about with https?

martindurant commented 5 years ago

You are right, it seems that fromfileis explicitly looking for a local file handle. I am testing np.load now, please stand by.

youngsaj commented 5 years ago

Thanks! I'll catch up tomorrow, if that can ultimately work it would be awesome!

martindurant commented 5 years ago

On latest master, the following works:

In [1]: import numpy as np

In [2]: import fsspec

In [3]: url = 'https://s3.amazonaws.com/MDtemp/mydata.npy?AWSAccessKeyId=AKIAJNQATRCEFOYOJKBA&Signature=0o3dyAoPO9IY0V%2Bs4J82Qzh0xVg%3D&Expires=1555969053'

In [4]: with fsspec.open(url, 'rb') as f:
   ...:     x = np.load(f)
   ...:

where the file was created with np.save. Note that the token included in the URL here will have expired by the time you see this, but it should work for your circumstance.

youngsaj commented 5 years ago

I'm confused on what to use. I have a specialized binary file (custom format i have a reader for). I can do import fsspec fs = fsspec.filesystem('https') f = fs.open(urlpath)

f.read(numbytes) f.seek(numbytes)

as soon as I do np.fromfile(f, .....

it doesn't like it. Am I confused on that np.fromfile is not allowed for HTTPfile?

I'm getting an odd Flush on closed file (line 859 in spec.py) Or, maybe I'm not handling the f properly and it's not being saved properly after fs.open? Or, not using the API quite right.

martindurant commented 5 years ago

np.fromfile is not allowed for HTTPfile

Basically, it cannot - because numpy has the ability to read memmap files (in the C code), it must have a real operating-system-level file handle, which can only exist for a local file. Many functions in the python ecosystem, such as np.load can read from general file-like objects, but not this one. You will, I think, have to load the bytes using f.read() and use np.frombuffer.

odd Flush on closed file

That's worth investigating, since flush should never be called on a read-mode file.

youngsaj commented 5 years ago

I'm finding issues with the closed file with this situation: print('opened:',self.fid.readable()) pdb.set_trace() self.read_raw_fun = lambda dim1range, dim2range: \ read_bip(self.fid, datasize, data_offset, datatype, bands, swapbytes, dim1range, dim2range, False)

in read_bip
I check first thing fid.readable() it throws ValueError: i/O operation on closed file

If I reopen within the function it works fine. Either something about with I'm not clear about, or something unique about fsspec?

martindurant commented 5 years ago

Hm, neither the previous version of HTTPFile nor the current one raise that exception, not sure what you might be doing (this is not the same exception you referred to above). The code you are showing doesn't actually parse. self.fid is a HTTPFile?

youngsaj commented 5 years ago

I was trying it with just a standard file to iron out the np.frombuffer first.

martindurant commented 5 years ago

I meant frombuffer in combination with bytes, not a file (i.e., what you would have following f.read())

youngsaj commented 5 years ago

Now I'm just at the point I need to iron out the byte counts and datatypes. Thanks!

martindurant commented 5 years ago

Probably we can close this issue, then?

youngsaj commented 5 years ago

I have that working fine with a standard file on local file system. I am doing sliced reads with a step of 10.

However, when supplying a urlpath now and it is using HTTPFile I get: getting to the fid.read here: npbuff = fid.read(np.uint64(bands) np.uint64(bands) dim2size) single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size)

Tries to do the read but I get this remaining stack trace: -> 197 npbuff = fid.read(np.uint64(bands) np.uint64(bands) dim2size) 198 single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size) 199 for j in range(bands): # Pixel intervleaved

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+21.gcd2e2fc-py3.7.egg\fsspec\implementations\http.py in read(self, length) 195 if length == 0: 196 return self._fetch_all() --> 197 return super().read(length) 198 199 def _fetch_all(self):

C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+21.gcd2e2fc-py3.7.egg\fsspec\spec.py in read(self, length) 954 self._fetch(self.loc, self.loc + length) 955 out = self.cache[self.loc - self.start: --> 956 self.loc - self.start + length] 957 self.loc += len(out) 958 if self.trim:

TypeError: slice indices must be integers or None or have an index method

martindurant commented 5 years ago

Can you debug at that point to find out what self.loc, self.start, length are. Better would be to provide the file and minimum code required to expose the problem.

youngsaj commented 5 years ago

The file is easy, publicly available via https: 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE16I_IM16I.nitf'

I'll try to pull out a small set of code to expose the issue. It reads the first buffer and gets it from buffer. Then does a seek to next location to read, the very next read it has the issue.

Let me put together a small bit of code to reproduce.

youngsaj commented 5 years ago

def read_bip(input_file, datasize, offset=0, datatype='float32', bands=1,
             swapbytes=False, dim1range=None, dim2range=None, usenpfile=True):
    """Generic function for reading data band interleaved by pixel.

    Data is read directly from disk with no transformation.  The most quickly
    incresing dimension on disk will be the most quickly increasing dimension in
    the array in memory.  No assumptions are made as to what the bands
    represent (complex i/q, etc.)

    INPUTS:
       fid: File identifier from open().  Must refer to a file that is open for
          reading as binary.
       datasize: 1x2 tuple/list (number of elements in first dimension, number
          of elements in the second dimension).  In keeping with the Python
          standard, the second dimension is the more quickly increasing as
          written in the file.
       offset: Index (in bytes) from the beginning of the file to the beginning
          of the data.  Default is 0 (beginning of file).
       datatype: Data type specifying binary data precision.  Default is
          dtype('float32').
       bands: Number of bands in data.  Default is 1.
       swapbytes: Whether the "endianness" of the data matches the "endianess"
          of our file reads.  Default is False.
       dim1range: ([start, stop,] step).  Similar syntax as Python range() or
          NumPy arange() functions.  This is the range of data to read in the
          less quickly increasing dimension (as written in the file).  Default
          is entire range.
       dim2range: ([start, stop,] step).  Similar syntax as Python range() or
          NumPy arange() functions.  This is the range of data to read in the
          more quickly increasing dimension (as written in the file).  Default
          is entire range.

    OUTPUT: Array of complex data values read from file.

    """

    # Check input arguments
#    datasize, dim1range, dim2range = chipper.check_args(
#        datasize, dim1range, dim2range)
    offset = np.array(offset, dtype='uint64')
    if offset.size == 1:   # Second term of offset allows for line prefix/suffix
        offset = np.append(offset, np.array(0, dtype='uint64'))
    # Determine element size
    datatype = np.dtype(datatype)  # Allows caller to pass dtype or string
    elementsize = np.uint64(datatype.itemsize * bands)
    # Read data (region of interest only)
    with fsspec.open(input_file,'rb') as fid:
        print('readable:',fid.readable())
        fid.seek(offset[0] +  # Beginning of data
                 (dim1range[0] * (datasize[1] * elementsize + offset[1])) +  # Skip to first row
                 (dim2range[0] * elementsize))  # Skip to first column
        dim2size = dim2range[1] - dim2range[0]
        lendim1range = len(range(*dim1range))
        dataout = np.zeros((bands, lendim1range, len(range(*dim2range))), datatype)
        # NOTE: MATLAB allows a "skip" parameter in its fread function.  This allows
        # one to do very fast reads when subsample equals 1 using only a single line
        # of code-- no loops!  Not sure of an equivalent way to do this in Python,
        # so we have to use "for" loops-- yuck!
        print('np.uint64(bands) * dim2size:',np.uint64(bands),' * ',dim2size,'=',np.uint64(bands) * dim2size)
        print('dim1range,dim2range,datasize:',dim1range,dim2range,datasize)
        for i in range(lendim1range):
            if(i>= lendim1range-5): print('i=',i)
            #single_line = np.fromfile(fid, datatype, np.uint64(bands) * dim2size)
            pdb.set_trace()
            npbuff = fid.read(np.uint64(bands)* np.uint64(bands) * dim2size)
            single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size)
            for j in range(bands):  # Pixel intervleaved
                dataout[j, i, :] = single_line[j::dim2range[2]*np.uint64(bands)]
            fid.seek(((datasize[1] * elementsize) + offset[1]) * (dim1range[2] - np.uint64(1)) +  # Skip unread rows
                     ((datasize[1] - dim2size) * elementsize) + offset[1], 1)  # Skip to beginning of dim2range
    if swapbytes:
        dataout.byteswap(True)
    return dataout

urlpath = 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE16I_IM16I.nitf'
dim1range=[0,9504,10]
dim2range=[0,8330,10]
read_bip(input_file=urlpath, datasize=[9504, 8330], offset=[929], datatype=np.dtype('int16'), bands=2, swapbytes=True, dim1range=dim1range, dim2range=dim2range)
martindurant commented 5 years ago

OK, so I've updated on master to force inputs (for seek and read) to ints. The docstrings say that the inputs must be ints. In general, you are using np.uint64 to cast your numbers to integer, but this doesn't work, you should use simple int instead.

youngsaj commented 5 years ago

Using int fixes that! Can close it now.

martindurant commented 5 years ago

excellent