Closed nguy closed 9 years ago
@nguy This sounds like a great addition to Py-ART.
First, the functionality to read in a single field from a file is available in Py-ART but could be improved. The field_names and exclude_fields parameters in Py-ART's read
functions can be used to select out a single field variable provided that you know before hand the field names used in the file. That said, for many of the formats supported by Py-ART the current implementation requires reading the entire file even when a single field is specified by these parameters. As such, the speed-up might be less spectacular then expected. In some of the supported formats reading only a single or a limited number of sweeps might provide a more significant speed up in read time.
Dropping down a layer of abstraction below the Radar class to the underlying file format classes such as MdvFile, NEXRADLevel2File and SigmetFile allows for a field and/or sweep specific reading of the data which does not requires additional data access which would speed things up. Trying to generalize this functionality to all supported file formats through the Radar object might be challenging for a number of reasons, including that some formats interleave field and sweep data making accessing a single variable difficult. Still given sufficient time I think this could be accomplished.
For an implementation of this proposal I would suggest adding a only_fields parameter to pyart.io.read
and underlying read_
functions which contain a list of fields to read exclusively as well as a nsweeps parameter which could be either a integer specifying that the first nsweeps should be read, or a list containing the sweep number from the file to read.
Let me put here another experience, that may help implementing this: don't give the user direct access to the array, but rather with a function getValue(), as several other libs (e.g. Nio, ScientificNetcdf). I even think there is a way in python of turning a slice [:] into a function call, I believe this is what NetCDF4 and h5py are doing, but I'm not sure we should.
Beside of only reading variables when needed, this had one unexpected advantage that I already used once: by manipulating that getValue() I was able of add pseudo-variables, i.e. a variables that look just like the others but is not present in the file (nor in the memory). Once needed its calculated from the others (e.g. KDP from PHIDP), moreover parameter of this calculus can be changes at any time. I'm not saying at all that this kind of manipulation is a good programing idea, but it has really helpful once.
At last, there are also some negative points with getValue(), first backwards compatibility. Second we would not be able to close the file in the end of read(), but rather keep the open file within Radar. That may have some performance consequences that I do not fully understand. Third if not implemented correctly it may cause Radar behavior to be dependent of the original file type (mdv, netcdf, sigmet etc.), we definitely don't want that.
P.S. I fully agree using NetCDF4, as it is the official lib, but are you aware it is some folds slower then its concurrents? including scipy.io, ScientificNetcdf, Nio and h5py.
I've looked at some of the other NetCDF Python libraries. From what I remember NetCDF4 was the only one that seemed to support both NetCDF3 and NetCDF4 files (scipy.io and ScientificNetCdf only supports 3, h5py only 4, not sure on PyNIO). In addition NetCDF4 can automatically apply the scaling, offset and masking of data which is very helpful. NetCDF4 also is relatively easy to install (it is included in Anaconda) where as some of the other packages are not. For me at least these benefits made up for any performance issues. That said I've never found the package overly slow. I've seen complex slicing operations NetCDF4 Variables can be slow, so I typically pull out all the data into memory using var[:] and then slice the ndarray object.
I should say I'm opposed to hiding the variable data behind a getValue
function call. One of Py-ART main goals is to make radar data easily available to users and I think data hiding would hinder this goal.
I've also seem numerous ndarray like objects (NetCDF4's Variable class, NioVariable, etc) which all implement slicing in a manner which is similar to but slightly different from NumPy's slicing (I've even implemented my own...). What this means for development is that anytime you perform a slicing operation you need to test the results against all of the objects that might be sent to the function and write code to support all the edge cases and difference. This can quickly become a heavy burden and a complicated mess which using ndarray from the get-go avoids. The other method of getting around having to deal with the different slicing behavior of array like objects is to cast to ndarray before slicing but this negates most of the reasons to uses these objects.
An alternative solution which would still maintain most backwards compatibility and have the same slicing behavior would be to replace the dictionaries which contain the field data with a class which performs lazy loading of the data key when that key is requested. In this way memory would not be required until data from the field is needed at which time it would be read from the file/object/etc and cached so that further access to that data would not require an additional read step. To the user the Radar objects would still be independent of the source of the data, the magic of loading data from the file would happen behind the scenes. Creating pseudo-variables would still be possible using this method.
For implementing such a solution I think there is a dict like abstract base class in the standard library which could be used to create a LazyDictionary class. Functions would need to be written which load data on demand from the various supported files. For some these would be straightforward (NetCDF in particular) other would be more difficult. Perhaps making this an optional feature of reading which are only supported by some file types would be reasonable and allow incremental roll out of this functionality. Pseudo-variable would be implemented as function which transform data from other variables.
The downsides of this lazy load field dictionaries is that the objects in the Radar.fields attribute would no longer be dict objects which some code may depend on although I would not expect much. Also the underlying file could not be closed until all fields have been read or all references to the Radar object are deleted which may be hard to detect. Performance may be diminished slightly in cases where reading all fields is more efficient but in cases when only a few fields are needed performance would be enhanced. The last downside is that this would still require loading data for the entire volume when any data, even a single sweep or ray, from the field is needed. If Sweep and Ray classes are introduced as proposed in #242 then sweep and ray aware lazy loading may be possible in this context.
I'll start a branch to test this out with NetCDF files when I have some free time.
I'll weigh in with my two cents. For a lot of workflows, keeping the file open can quickly become burdensome(I often find myself touching 1K+ files). Also, I personally prefer for the readers to be more predictable. I prefer to know when I'll incur the io/time hit for reading a file rather than lazy evaluation, unless it is optional(I can see good use cases for both). The question then becomes, if it is optional, does it add a significant maintenance burden to keep two different paths in the codebase?
As to replacing the fields with classes, I'm personally against it, though for reasons that may be more laziness than technical. In particular, right now it is easy to use savez and savemat to move data back and forth between different scripts and matlab/IDL(By just saving out the entire fields dict). If we move to classes, I imagine that would become significantly more challenging.
Also, while I fully support the option to only load certain variables for memory reasons, as Jonathan pointed out, it will not provide any(or only very minimal) actual loading or processing time benefits for the vast majority of the radar formats due to how they interleave the moments.
A better link is to the Collections Abstract Base Classes. The MutableMapping class would be a good choice for a LazyDictionary
PR #286 adds support for delayed loading of field data using a dictionary which does lazy loading of the data key. For NEXRAD Level 2 Archive files I found that accessing a single field in this manner took about half the time as loading all the field data. A single field from a CfRadial files can be read even faster.
Is something like this desired for other file formats? Adding support for MDV, NEXRAD Level 2 CDM and files read using RSL should be relatively short tasks. Sigmet files are much harder. I don't see a reason for this in NEXRAD Level 3 files and cannot comment on the time required for CSU Chill files.
The MdvFile already have this functionality, so its really a short task. I may even add to PR #277 since I'm already working in that part of the code.
I've just realized this functionality is useless in the case of cfradial when ngates_vary=True
(i.e ray_n_gates
exists) because of executing _unpack_variable_gate_field_dic. I don't think we should solve this, but I wonder if would be the case of adding some note in the documentation.
Added a note in the read_cfradial documentation concerning delayed field loading and ngates_vary in PR #301.
@nguy does the functionality of the delay_field_loading parameter sufficiently meet the need for faster reading of a single field or is there still a desire for additional enhancements?
@jjhelmus I'll look at this shortly. Been crazy busy and we just restructured ARTView so I need a bit to dig into this.
@jjhelmus I just looked at this with an older version of ARTView and it works just fine. I'll check on the newest version soon, but I expect it to work without issue. Thanks!
@nguy just found a bug: read is passing delay_field_loading
argument direct to the specific read function, and, as not all of them have this option, it is raising an error. I see 3 solutions
delay_field_loading
option, even if not implementedpyart.io.read
filter this option out before passing **kwargs@gamaanderson @jjhelmus , my vote would be for Solution 3. I think solution 2 would need to cause changes to occur in 'read' function as more file types are added (and I'm sure there are more formats lurking in the shadows out there). Option 1 is reasonable as well, but maybe not as elegant? Just FYI, this came up in ARTView Pull Request #33.
Solution 3 sounds good to me also, it will allow individual read_foo
function to provide file specific keyword arguments that can be safely passed to the pyart.io.read
function. It does open up the possibility of arguments silently being ignored without informing the user, for example passing delay_field_loading to read_sigmet does nothing but the user will not know. A warning could be issues when a non-empty kwargs dictionary is found to relay this infromation to the user but I think this would clutter up the code a bit more than is needed.
I'll create a PR adding **kwargs for all read_foo
function this afternoon unless someone gets to it before me.
agree, @jjhelmus I will get to it
I'm adding the following test just after the doc string of all read_foo
:
#test for non empty kwargs
if not kwargs:
import warnings
warnings.warn('Unexpected arguments: %s' % kwargs.keys())
Make that a function in io/common.py, that way we can change the warning in one place if needed.
ok,
def test_arguments(dic):
if not dic:
import warnings
warnings.warn('Unexpected arguments: %s' % dic.keys())
and
#test for non empty kwargs
test_arguments(kwargs)
is that good?
Also, the check should be if kwargs
. Empty dictionaries evaluate to False and non-empty evaluate to True.
Make it _test_arguments so that it is private.
I'm testing if is empty, so it must be not False => True
We want the warning to be issue when the dictionary is not empty. When it is empty there were no arguments collected by kwargs. Write a unit test and see how it works, I could be wrong.
yes, of course, you are right!
Also adding to io_aux, one less thing to do if they are moved into io
**kwargs added to read_foo functions in PR #322
I'm assuming with this fix, ARTView should work better and this issue can be closed?
Yes, as @nguy already said he has tested it and so did I, this can be closed.
I just posted the ARTView package and one thing I can't get around is that it takes a while to read in an entire radar volume file (due to the large amount of data). It would be great if an alternative IO were allowed that one could call a single variable and still use the PyArt mapping. I think this is the bottleneck when redisplaying a file in the same window that already exists. Preliminary, basic tests indicate that the IO is slowest part.
I'd still like to be able to pull over all other information, including a list of variable names, but just a single variable at a time.
Thanks!