Closed youngsaj closed 5 years ago
It seems to work for me:
x = np.ones(10)
x.tofile('mydata')
file = fsspec.open('file:///Users/mdurant/mydata', 'rb')
with file as f:
x = np.fromfile(f)
How about with https?
You are right, it seems that fromfile
is explicitly looking for a local file handle. I am testing np.load
now, please stand by.
Thanks! I'll catch up tomorrow, if that can ultimately work it would be awesome!
On latest master, the following works:
In [1]: import numpy as np
In [2]: import fsspec
In [3]: url = 'https://s3.amazonaws.com/MDtemp/mydata.npy?AWSAccessKeyId=AKIAJNQATRCEFOYOJKBA&Signature=0o3dyAoPO9IY0V%2Bs4J82Qzh0xVg%3D&Expires=1555969053'
In [4]: with fsspec.open(url, 'rb') as f:
...: x = np.load(f)
...:
where the file was created with np.save
. Note that the token included in the URL here will have expired by the time you see this, but it should work for your circumstance.
I'm confused on what to use. I have a specialized binary file (custom format i have a reader for). I can do import fsspec fs = fsspec.filesystem('https') f = fs.open(urlpath)
f.read(numbytes) f.seek(numbytes)
as soon as I do np.fromfile(f, .....
it doesn't like it. Am I confused on that np.fromfile is not allowed for HTTPfile?
I'm getting an odd Flush on closed file (line 859 in spec.py) Or, maybe I'm not handling the f properly and it's not being saved properly after fs.open? Or, not using the API quite right.
np.fromfile is not allowed for HTTPfile
Basically, it cannot - because numpy has the ability to read memmap files (in the C code), it must have a real operating-system-level file handle, which can only exist for a local file. Many functions in the python ecosystem, such as np.load
can read from general file-like objects, but not this one. You will, I think, have to load the bytes using f.read()
and use np.frombuffer
.
odd Flush on closed file
That's worth investigating, since flush should never be called on a read-mode file.
I'm finding issues with the closed file with this situation: print('opened:',self.fid.readable()) pdb.set_trace() self.read_raw_fun = lambda dim1range, dim2range: \ read_bip(self.fid, datasize, data_offset, datatype, bands, swapbytes, dim1range, dim2range, False)
in read_bip
I check first thing
fid.readable()
it throws
ValueError: i/O operation on closed file
If I reopen within the function it works fine. Either something about with I'm not clear about, or something unique about fsspec?
Hm, neither the previous version of HTTPFile nor the current one raise that exception, not sure what you might be doing (this is not the same exception you referred to above). The code you are showing doesn't actually parse.
self.fid
is a HTTPFile?
I was trying it with just a standard file to iron out the np.frombuffer first.
I meant frombuffer in combination with bytes, not a file (i.e., what you would have following f.read()
)
Now I'm just at the point I need to iron out the byte counts and datatypes. Thanks!
Probably we can close this issue, then?
I have that working fine with a standard file on local file system. I am doing sliced reads with a step of 10.
However, when supplying a urlpath now and it is using HTTPFile I get: getting to the fid.read here: npbuff = fid.read(np.uint64(bands) np.uint64(bands) dim2size) single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size)
Tries to do the read but I get this remaining stack trace: -> 197 npbuff = fid.read(np.uint64(bands) np.uint64(bands) dim2size) 198 single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size) 199 for j in range(bands): # Pixel intervleaved
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+21.gcd2e2fc-py3.7.egg\fsspec\implementations\http.py in read(self, length) 195 if length == 0: 196 return self._fetch_all() --> 197 return super().read(length) 198 199 def _fetch_all(self):
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+21.gcd2e2fc-py3.7.egg\fsspec\spec.py in read(self, length) 954 self._fetch(self.loc, self.loc + length) 955 out = self.cache[self.loc - self.start: --> 956 self.loc - self.start + length] 957 self.loc += len(out) 958 if self.trim:
TypeError: slice indices must be integers or None or have an index method
Can you debug at that point to find out what self.loc, self.start, length
are.
Better would be to provide the file and minimum code required to expose the problem.
The file is easy, publicly available via https: 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE16I_IM16I.nitf'
I'll try to pull out a small set of code to expose the issue. It reads the first buffer and gets it from buffer. Then does a seek to next location to read, the very next read it has the issue.
Let me put together a small bit of code to reproduce.
def read_bip(input_file, datasize, offset=0, datatype='float32', bands=1,
swapbytes=False, dim1range=None, dim2range=None, usenpfile=True):
"""Generic function for reading data band interleaved by pixel.
Data is read directly from disk with no transformation. The most quickly
incresing dimension on disk will be the most quickly increasing dimension in
the array in memory. No assumptions are made as to what the bands
represent (complex i/q, etc.)
INPUTS:
fid: File identifier from open(). Must refer to a file that is open for
reading as binary.
datasize: 1x2 tuple/list (number of elements in first dimension, number
of elements in the second dimension). In keeping with the Python
standard, the second dimension is the more quickly increasing as
written in the file.
offset: Index (in bytes) from the beginning of the file to the beginning
of the data. Default is 0 (beginning of file).
datatype: Data type specifying binary data precision. Default is
dtype('float32').
bands: Number of bands in data. Default is 1.
swapbytes: Whether the "endianness" of the data matches the "endianess"
of our file reads. Default is False.
dim1range: ([start, stop,] step). Similar syntax as Python range() or
NumPy arange() functions. This is the range of data to read in the
less quickly increasing dimension (as written in the file). Default
is entire range.
dim2range: ([start, stop,] step). Similar syntax as Python range() or
NumPy arange() functions. This is the range of data to read in the
more quickly increasing dimension (as written in the file). Default
is entire range.
OUTPUT: Array of complex data values read from file.
"""
# Check input arguments
# datasize, dim1range, dim2range = chipper.check_args(
# datasize, dim1range, dim2range)
offset = np.array(offset, dtype='uint64')
if offset.size == 1: # Second term of offset allows for line prefix/suffix
offset = np.append(offset, np.array(0, dtype='uint64'))
# Determine element size
datatype = np.dtype(datatype) # Allows caller to pass dtype or string
elementsize = np.uint64(datatype.itemsize * bands)
# Read data (region of interest only)
with fsspec.open(input_file,'rb') as fid:
print('readable:',fid.readable())
fid.seek(offset[0] + # Beginning of data
(dim1range[0] * (datasize[1] * elementsize + offset[1])) + # Skip to first row
(dim2range[0] * elementsize)) # Skip to first column
dim2size = dim2range[1] - dim2range[0]
lendim1range = len(range(*dim1range))
dataout = np.zeros((bands, lendim1range, len(range(*dim2range))), datatype)
# NOTE: MATLAB allows a "skip" parameter in its fread function. This allows
# one to do very fast reads when subsample equals 1 using only a single line
# of code-- no loops! Not sure of an equivalent way to do this in Python,
# so we have to use "for" loops-- yuck!
print('np.uint64(bands) * dim2size:',np.uint64(bands),' * ',dim2size,'=',np.uint64(bands) * dim2size)
print('dim1range,dim2range,datasize:',dim1range,dim2range,datasize)
for i in range(lendim1range):
if(i>= lendim1range-5): print('i=',i)
#single_line = np.fromfile(fid, datatype, np.uint64(bands) * dim2size)
pdb.set_trace()
npbuff = fid.read(np.uint64(bands)* np.uint64(bands) * dim2size)
single_line = np.frombuffer(npbuff,dtype=datatype,count=np.uint64(bands)*dim2size)
for j in range(bands): # Pixel intervleaved
dataout[j, i, :] = single_line[j::dim2range[2]*np.uint64(bands)]
fid.seek(((datasize[1] * elementsize) + offset[1]) * (dim1range[2] - np.uint64(1)) + # Skip unread rows
((datasize[1] - dim2size) * elementsize) + offset[1], 1) # Skip to beginning of dim2range
if swapbytes:
dataout.byteswap(True)
return dataout
urlpath = 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE16I_IM16I.nitf'
dim1range=[0,9504,10]
dim2range=[0,8330,10]
read_bip(input_file=urlpath, datasize=[9504, 8330], offset=[929], datatype=np.dtype('int16'), bands=2, swapbytes=True, dim1range=dim1range, dim2range=dim2range)
OK, so I've updated on master to force inputs (for seek and read) to ints. The docstrings say that the inputs must be ints. In general, you are using np.uint64
to cast your numbers to integer, but this doesn't work, you should use simple int
instead.
Using int fixes that! Can close it now.
excellent
Question: I have a final hurdle in getting https files read, metadata gets read correctly, just np.fromfile doesn't like the fid. It's not clear to me if readbytes is a close replacement. Since we end up looping, using np.fromfile and fid.seek I suspect there's a cleaner way now using fsspec to subsample (seek to where we need). What's the best np.fromfile replacement to use fsspec?