Open davidhassell opened 1 year ago
Not all data are equal! It may be that data for metadata constructs (coordinates, etc.) could/should be treated differently to the field construct data. This is because:
f + 2
, the field data changes, but the coordinate data does not.One thing that can be done now, and ought to be a good idea in any event, is to cache first, second and last element values after they have been created during a str
operation, making them available to subsequent calls. These cached values can also be set during cf.read
, when the file is open and the values are available at almost no cost. Any operation that changes the data would cause cached values to be removed, forcing them to be recalculated during the next str
.
This approach does not prejudice what else we decide to do, but does speed up at str(d)
command by a factor of 1000 (8 us vs. 8 ms on my laptop) when the cached values are present (which they would be for all data just instantiated from a file).
PR for this to follow.
We also have to think about this in the context of a field construct print
or dump
, which displays the data of various types of construct. In particular, the dump
shows the first and last elements of the field construct's data.
In [2]: import cf; f = cf.example_field(0)
In [3]: print(f)
Field: specific_humidity (ncvar%q)
----------------------------------
Data : specific_humidity(latitude(5), longitude(8)) 1
Cell methods : area: mean
Dimension coords: latitude(5) = [-75.0, ..., 75.0] degrees_north
: longitude(8) = [22.5, ..., 337.5] degrees_east
: time(1) = [2019-01-01 00:00:00]
In [4]: f.dump()
----------------------------------
Field: specific_humidity (ncvar%q)
----------------------------------
Conventions = 'CF-1.10'
project = 'research'
standard_name = 'specific_humidity'
units = '1'
Data(latitude(5), longitude(8)) = [[0.007, ..., 0.013]] 1
Cell Method: area: mean
Domain Axis: latitude(5)
Domain Axis: longitude(8)
Domain Axis: time(1)
Dimension coordinate: latitude
standard_name = 'latitude'
units = 'degrees_north'
Data(latitude(5)) = [-75.0, ..., 75.0] degrees_north
Bounds:units = 'degrees_north'
Bounds:Data(latitude(5), 2) = [[-90.0, ..., 90.0]] degrees_north
Dimension coordinate: longitude
standard_name = 'longitude'
units = 'degrees_east'
Data(longitude(8)) = [22.5, ..., 337.5] degrees_east
Bounds:units = 'degrees_east'
Bounds:Data(longitude(8), 2) = [[0.0, ..., 360.0]] degrees_east
Dimension coordinate: time
standard_name = 'time'
units = 'days since 2018-12-01'
Data(time(1)) = [2019-01-01 00:00:00]
This opens up the possibility that the field construct's print
or dump
might want to specify a different behaviours for different Data
objects that it contains.
E.g. we might want to say that, unless there are cached values, don't show the field's data values, something like:
In [4]: f.dump()
----------------------------------
Field: specific_humidity (ncvar%q)
----------------------------------
Conventions = 'CF-1.10'
project = 'research'
standard_name = 'specific_humidity'
units = '1'
Data(latitude(5), longitude(8)) = [[??, ..., ??]] 1
Cell Method: area: mean
Domain Axis: latitude(5)
Domain Axis: longitude(8)
Domain Axis: time(1)
Dimension coordinate: latitude
standard_name = 'latitude'
units = 'degrees_north'
Data(latitude(5)) = [-75.0, ..., 75.0] degrees_north
Bounds:units = 'degrees_north'
Bounds:Data(latitude(5), 2) = [[-90.0, ..., 90.0]] degrees_north
<snip>
is to cache first, second and last element values after they have been created during a str operation, making them available to subsequent calls. These cached values can also be set during cf.read, when the file is open and the values available at almost not cost. Any operation that changes the data would cause any cached values to be removed, forcing them to be recalculated during the next str.
Nice idea, for starters!
I'm reading all of your comments as them come in, and generally having a think about this on the whole.
I'm happy to close this, now that #494 is in, on the grounds that preserving the API (i.e. that first/second/last values are printed) is important, and the cached values will fix the performance issues in many cases. That OK with you, @sadielbartholomew?
Hi @davidhassell, generally I am happy to consider this resolved, at least with respect to doing the 3.14 release (since I guess we may want to tweak the approach to some extent for later releases, or not, but depending on how users find it).
I think that all of the points you've mentioned below, and from our discussions on this generally, are covered well and sensibly now after your update in #494 and otherwise, with the only point I think we've possibly not covered is that which you highlight in https://github.com/NCAS-CMS/cf-python/issues/493#issuecomment-1310644099, namely that we are currently treating data and metadata as equal with respect to this issue as far as I can see. Do we want to do something to provide flexibility, say provide a means to only report and/or cache the metadata (not the data itself)? Or maybe we could consider that after 3.14, if it is worthwhile at all?
I'd forgotten about possibility the data/metadata split. I'm inclined to push it to after 3.14.0, if we find it worth it, because we've enough on our plate for our release deadline as it is. Let's leave this issue open in the mean time ....
I'm inclined to push it to after 3.14.0, if we find it worth it, because we've enough on our plate for our release deadline as it is.
I agree. That sounds like a wise plan. We must have our pie (and eat it - that works as a saying, right?) on (or before) January 31st!
This issue is for discussing ways in which
Data.__str__
can be made to perform nicely when its data is stored in adask
array.Exposition
The string representation of a
Data
object is currently inherited from cfdm, and looks like:I.e. it prints the first and last elements (and the second element if there only 3 of them).
With
dask
representing the data, and using the code inherited from cfdm with no changes, printing these elements could