NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
120 stars 19 forks source link

Discussion: Managing `Data.__str__` with `dask` #493

Open davidhassell opened 1 year ago

davidhassell commented 1 year ago

This issue is for discussing ways in which Data.__str__ can be made to perform nicely when its data is stored in a dask array.

Exposition

The string representation of a Data object is currently inherited from cfdm, and looks like:

>>> import cf
>>> d = cf.example_field(0).data
>>> str(d)
[[0.007, ..., 0.013]] 1

I.e. it prints the first and last elements (and the second element if there only 3 of them).

With dask representing the data, and using the code inherited from cfdm with no changes, printing these elements could

  1. trigger an expensive and slow computation
  2. require the reading from disk of an entire dask chunk per element printed. If each chunk has the default size of 128 MiB, then that could entail reading 256 MiB from disk just to print two numbers.
davidhassell commented 1 year ago

Not all data are equal! It may be that data for metadata constructs (coordinates, etc.) could/should be treated differently to the field construct data. This is because:

  1. metadata construct data is typically (but always) a lot smaller
  2. metadata construct data is a lot less likely to have a significant delayed computation. E.g. when you do f + 2, the field data changes, but the coordinate data does not.
davidhassell commented 1 year ago

One thing that can be done now, and ought to be a good idea in any event, is to cache first, second and last element values after they have been created during a str operation, making them available to subsequent calls. These cached values can also be set during cf.read, when the file is open and the values are available at almost no cost. Any operation that changes the data would cause cached values to be removed, forcing them to be recalculated during the next str.

This approach does not prejudice what else we decide to do, but does speed up at str(d) command by a factor of 1000 (8 us vs. 8 ms on my laptop) when the cached values are present (which they would be for all data just instantiated from a file).

PR for this to follow.

davidhassell commented 1 year ago

We also have to think about this in the context of a field construct print or dump, which displays the data of various types of construct. In particular, the dump shows the first and last elements of the field construct's data.

In [2]: import cf; f = cf.example_field(0)

In [3]: print(f)
Field: specific_humidity (ncvar%q)
----------------------------------
Data            : specific_humidity(latitude(5), longitude(8)) 1
Cell methods    : area: mean
Dimension coords: latitude(5) = [-75.0, ..., 75.0] degrees_north
                : longitude(8) = [22.5, ..., 337.5] degrees_east
                : time(1) = [2019-01-01 00:00:00]

In [4]: f.dump()
----------------------------------
Field: specific_humidity (ncvar%q)
----------------------------------
Conventions = 'CF-1.10'
project = 'research'
standard_name = 'specific_humidity'
units = '1'

Data(latitude(5), longitude(8)) = [[0.007, ..., 0.013]] 1

Cell Method: area: mean

Domain Axis: latitude(5)
Domain Axis: longitude(8)
Domain Axis: time(1)

Dimension coordinate: latitude
    standard_name = 'latitude'
    units = 'degrees_north'
    Data(latitude(5)) = [-75.0, ..., 75.0] degrees_north
    Bounds:units = 'degrees_north'
    Bounds:Data(latitude(5), 2) = [[-90.0, ..., 90.0]] degrees_north

Dimension coordinate: longitude
    standard_name = 'longitude'
    units = 'degrees_east'
    Data(longitude(8)) = [22.5, ..., 337.5] degrees_east
    Bounds:units = 'degrees_east'
    Bounds:Data(longitude(8), 2) = [[0.0, ..., 360.0]] degrees_east

Dimension coordinate: time
    standard_name = 'time'
    units = 'days since 2018-12-01'
    Data(time(1)) = [2019-01-01 00:00:00]

This opens up the possibility that the field construct's print or dump might want to specify a different behaviours for different Data objects that it contains.

E.g. we might want to say that, unless there are cached values, don't show the field's data values, something like:

In [4]: f.dump()
----------------------------------
Field: specific_humidity (ncvar%q)
----------------------------------
Conventions = 'CF-1.10'
project = 'research'
standard_name = 'specific_humidity'
units = '1'

Data(latitude(5), longitude(8)) = [[??, ..., ??]] 1

Cell Method: area: mean

Domain Axis: latitude(5)
Domain Axis: longitude(8)
Domain Axis: time(1)

Dimension coordinate: latitude
    standard_name = 'latitude'
    units = 'degrees_north'
    Data(latitude(5)) = [-75.0, ..., 75.0] degrees_north
    Bounds:units = 'degrees_north'
    Bounds:Data(latitude(5), 2) = [[-90.0, ..., 90.0]] degrees_north

<snip>
sadielbartholomew commented 1 year ago

is to cache first, second and last element values after they have been created during a str operation, making them available to subsequent calls. These cached values can also be set during cf.read, when the file is open and the values available at almost not cost. Any operation that changes the data would cause any cached values to be removed, forcing them to be recalculated during the next str.

Nice idea, for starters!

I'm reading all of your comments as them come in, and generally having a think about this on the whole.

davidhassell commented 1 year ago

I'm happy to close this, now that #494 is in, on the grounds that preserving the API (i.e. that first/second/last values are printed) is important, and the cached values will fix the performance issues in many cases. That OK with you, @sadielbartholomew?

sadielbartholomew commented 1 year ago

Hi @davidhassell, generally I am happy to consider this resolved, at least with respect to doing the 3.14 release (since I guess we may want to tweak the approach to some extent for later releases, or not, but depending on how users find it).

I think that all of the points you've mentioned below, and from our discussions on this generally, are covered well and sensibly now after your update in #494 and otherwise, with the only point I think we've possibly not covered is that which you highlight in https://github.com/NCAS-CMS/cf-python/issues/493#issuecomment-1310644099, namely that we are currently treating data and metadata as equal with respect to this issue as far as I can see. Do we want to do something to provide flexibility, say provide a means to only report and/or cache the metadata (not the data itself)? Or maybe we could consider that after 3.14, if it is worthwhile at all?

davidhassell commented 1 year ago

I'd forgotten about possibility the data/metadata split. I'm inclined to push it to after 3.14.0, if we find it worth it, because we've enough on our plate for our release deadline as it is. Let's leave this issue open in the mean time ....

sadielbartholomew commented 1 year ago

I'm inclined to push it to after 3.14.0, if we find it worth it, because we've enough on our plate for our release deadline as it is.

I agree. That sounds like a wise plan. We must have our pie (and eat it - that works as a saying, right?) on (or before) January 31st!