NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
119 stars 19 forks source link

I am always struggling with dimensions, axes, and coordinates in cf-python - it seems that ChatGPT does too! #788

Open bnlawrence opened 2 months ago

bnlawrence commented 2 months ago

I want to find which axis in the field is associated with time, but I also want to know where it is in the chunk shape matrix.

Concrete example: I know I can do this: t_axis = t = ff.coordinate('T'), but I don't know (programmatically) where that might be in an array shape of (720, 1920, 2560). (The answer is 0, but it could be 1 or 2).

I see that t_axis.identity() gives me time. Good. As it happens, I asked ChatGPT how to do this: "I am using cf-python to read some netcdf data. How do I find out which axis of the data is the time axis (if there is a time axis)" and ChatGPT suggested that I "Iterate through the field constructs to identify which one represents the time axis.".

Great. I thought, given the above. ChatGPT even gave me this bit of code:

 time_axes = field.domain_axes(todict=True)
    for key, domain_axis in time_axes.items():
        if 'T' in domain_axis.identity():
            print(f"Field: {field.identity()}")
            print(f"Time Axis: {key}, {domain_axis}")

That looked like it should work. But I didn't notice the difference between a domain axis and a coordinate (and neither did ChatGPT), which I expect explains that this is what the identities returned in that loop actually are: ncdim%time, ncdim%latitude, ncdim%longitude.

So that didn't work. At this point I am thinking, this is too hard.

Yuck. I think civilians find these distinctions impossible to work with.

bnlawrence commented 2 months ago

So ChatGPT also offers this:

        for axis in field.domain_axes(todict=True).values():
            coord = field.coordinate(axis)

which gets one deeper into the mire, as that one raises a stack dump. I guess because an axis is not a coordinate.

What I find interesting is that the LLM is make the same sort of leaps between things it understands based on the rest of the corpus of knowledge about NetCDF that I do ... and getting the same wrong answers. I think most non-CF experts do, which is the problem I have with the API ... there has to be a way to distinguish better between the underlying data model and what people want to do.

bnlawrence commented 2 months ago

(And yes, I realise the solution is to loop over f.coordinates(), but that's not my point.)

sadielbartholomew commented 2 months ago

Good. As it happens, I asked ChatGPT how to do this

My main response is that you (anyone!) should be consulting the documentation before consulting ChatGPT! (One would hope that such AI tools as the latter can be trained with, and sufficiently understand, documentation especially the API reference and tutorial, but who knows how well it can interpret it sufficiently well to be able to provide practical solutions...)

I imagine you probably looked at the documentation (just didn't state this explicitly in your comment), so probably this isn't the response you want to hear, but I highlight it because then the the underlying issue is that our documentation not in a format where you could find what you needed to know quickly, which is a larger problem to solve, but on the agenda for our summit. An issue which perhaps need to be given high priority!

I want to find which axis in the field is associated with time, but I also want to know where it is in the chunk shape matrix.

(And yes, I realise the solution is to loop over f.coordinates(), but that's not my point.)

As for your difficulty at hand...

In fact, don't listen to ChatGPT! There is no need to loop over anything (though that would also work, just is less direct and elegant). There are of course many ways to do it (against the Python mantra, but then again Python itself doesn't follow it's own mantra :smile: ) but I recommend you check out the 'Filtering Metadata Constructs' part of the tutorial of the documentation, which lists methods for filtering on various different objects you might want to query on.

In this case, use filter_by_axis("T") or equivalently filter_by_axis("time") or the same arguments but with filter_by_identity on the field's constructs to get the time axis and you can query it with various methods as shown in the __dir__() list of possibilities (obviously ignore the underscore ones which aren't intended for user usage), including to get the data axes, array values, etc. including the filtering methods also being available on that:

>>> f = cf.example_field(1)
>>> t = f.constructs.filter_by_axis("T")
>>> t
<CF Constructs: dimension_coordinate(1)>
>>> print(t)
Constructs:
{'dimensioncoordinate3': <CF DimensionCoordinate: time(1) days since 2018-12-01 >}
>>> t.__dir__()
['_field_data_axes', '_key_base', '_array_constructs', '_non_array_constructs', '_construct_axes', '_construct_type', '_constructs', '_ignore', '_prefiltered', '_filters_applied', '__module__', '__doc__', '__repr__', '_matching_values', '_flip', 'close', '_filter_by_identity', '_short_iteration', '__call__', '__contains__', '__copy__', '__deepcopy__', '__docstring_package_depth__', '__docstring_substitutions__', '__getitem__', '__init__', '__iter__', '__len__', '__str__', '_atol', '_axes_to_constructs', '_check_construct_type', '_clear', '_component_filter', '_construct_dict', '_construct_type_description', '_custom', '_default', '_del_component', '_del_construct', '_del_data_axes', '_dictionary', '_equals', '_equals_cell_method', '_equals_coordinate_reference', '_equals_domain_axis', '_equals_preprocess', '_filter_by_axis', '_filter_by_cell', '_filter_by_connectivity', '_filter_by_data', '_filter_by_key', '_filter_by_measure', '_filter_by_method', '_filter_by_naxes', '_filter_by_ncdim', '_filter_by_ncvar', '_filter_by_property', '_filter_by_size', '_filter_by_type', '_filter_convert_to_domain_axis', '_filter_parse_args', '_filter_preprocess', '_get_component', '_has_component', '_iter', '_package', '_pop', '_rtol', '_set_climatology', '_set_component', '_set_construct', '_set_construct_data_axes', '_update', '_view', 'clear_filters_applied', 'construct_type', 'construct_types', 'copy', 'data_axes', 'domain_axes', 'domain_axis_identity', 'equals', 'filter', 'filter_by_axis', 'filter_by_cell', 'filter_by_connectivity', 'filter_by_data', 'filter_by_identity', 'filter_by_key', 'filter_by_measure', 'filter_by_method', 'filter_by_naxes', 'filter_by_ncdim', 'filter_by_ncvar', 'filter_by_property', 'filter_by_size', 'filter_by_type', 'filters_applied', 'get', 'get_data_axes', 'inverse_filter', 'items', 'key', 'keys', 'new_identifier', 'ordered', 'replace', 'shallow_copy', 'todict', 'unfilter', 'value', 'values', '__doc_template__', '__dict__', '__weakref__', '__new__', '__hash__', '__getattribute__', '__setattr__', '__delattr__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__reduce_ex__', '__reduce__', '__getstate__', '__subclasshook__', '__init_subclass__', '__format__', '__sizeof__', '__dir__', '__class__']
>>> t.data_axes()
{'dimensioncoordinate3': ('domainaxis3',)}
>>> t.value().array
array([31.])
bnlawrence commented 2 months ago

That's not actually what I wanted to do. I wanted to know the index of the time axis in the shape of the data array ... so filtering is no good if it doesn't tell me that.

Frankly, most people are no longer going to consult documentation beyond page one (I don't any more, I get a much better idea MUCH FASTER from ChatGPT. Yes it's sometimes wrong, but it's informatively wrong). What we need are the docs to be good enough for ChatGPT (and friends) to get it right ... and in fact, it did help, because I was able to recognise that the loop and check was what I wanted to do, it was just that ChatGPT (and me) looped over the wrong thing first ... but it was much easier for me to work out the right thing from there, than it was in the docs, from scratch.

But in this case, my point is that the API is not intuitive. I was just amused that ChatGPT finds it un-intuitive in exactly the same way that I do.

sadielbartholomew commented 2 months ago

I wanted to know the index of the time axis in the shape of the data array ... so filtering is no good if it doesn't tell me that.

OK fair, I misunderstood what you wanted. I appreciate the specific question is not so much the point of this thread/Issue, but FYI, we have the indices method for that - I doubt anyone could argue that name isn't intuitive for such a method where you want to know the indices (plural form necessary since that captures the more general case)?

For example:

>>> f = cf.example_field(1)
>>> i = f.indices(T=slice(None))  # slice representing the whole time/T array
>>> i
(slice(None, None, None), slice(None, None, None), slice(None, None, None))
>>> f[i]
<CF Field: air_temperature(atmosphere_hybrid_height_coordinate(1), grid_latitude(10), grid_longitude(9)) K>

In this case we are talking more about sub-spacing than filtering, but either way API-wise you need a way to specify you want to know about the time axis and not anything else, right?

Frankly, most people are no longer going to consult documentation beyond page one (I don't any more, I get a much better idea MUCH FASTER from ChatGPT. Yes it's sometimes wrong, but it's informatively wrong). What we need are the docs to be good enough for ChatGPT (and friends) to get it right ... and in fact, it did help, because I was able to recognise that the loop and check was what I wanted to do, it was just that ChatGPT (and me) looped over the wrong thing first ... but it was much easier for me to work out the right thing from there, than it was in the docs, from scratch.

Well, I would say ChatGPT wasn't that helpful in this case because we offer means to do it without a loop, which should be used over a loop ideally because the user-side iteration is not necessary. But that's not to say it isn't helpful generally. I just aspire to having documentation that can be a first port of call for people, that they feel comfortable consulting and that feels like a natural and well-structured source of information and not intimidating, directing them to what they want to know without them having to wade through other information, because you are right that people won't read through anything that is too long or detailed. (Anyone can also use ChatGPT to get code for them, but that's a black box we can't control so doesn't merit much concern in my opinion).

But in this case, my point is that the API is not intuitive. I was just amused that ChatGPT finds it un-intuitive in exactly the same way that I do.

For the most part, I would disagree. I think it is fairly intuitive already, though there will absolutely be room for improvement in many specific areas. To me, it is a documentation issue mostly - the documentation needs to make it clearer how to do such things as filtering, getting indices, and so on - common operations that users will want to do, in a way that directs them to relevant pages covering examples of what they want (code snippets as ChatGPT would provide which are especially useful). I am thinking we could even have a flow chart to point people to the right locations in the (revamped) documentation, to help this.

But I am happy if you can try to persuade me otherwise about the intuitiveness or otherwise. We can discuss this further at the summit next week. I suspect David will have strong opinions :slightly_smiling_face:

davidhassell commented 2 months ago

Does this help ....

>>> f = cf.example_field(1)
>>> f  # No T dimension in the data, but there is an X dimension
<CF Field: air_temperature(atmosphere_hybrid_height_coordinate(1), grid_latitude(10), grid_longitude(9)) K>
>>> f.get_data_axes()
('domainaxis0', 'domainaxis1', 'domainaxis2')

>>> T = f.domain_axis('T', key=True)
>>> T
'domainaxis3'
>>> T in f.get_data_axes()
False

>>> X = f.domain_axis('X', key=True)
>>> X
'domainaxis2'
>>> print (f"X is at position {f.get_data_axes().index(X)}")
X is at position 2
davidhassell commented 2 months ago

Not answering the question directly, but there is some mention of this sort of thing in the tutorial: https://ncas-cms.github.io/cf-python/tutorial.html#data-axes

dwest77a commented 2 months ago

FYI the Xarray equivalent of getting the index of a specific (data) axis appears to be:

>>> import xarray as xr
>>>
>>> ds = xr.open_dataset('file1.nc')
>>> var = ds['var'] # Deal with a specific Xarray.DataArray
>>> var.dims
('time','lat','lon')
>>> var.get_axis_num('time')
0
>>> var.get_axis_num('lon')
2
davidhassell commented 2 months ago

Thanks, Dan!

We could of course create cf.Field.get_axis_position('T'), based on the cf-python code above.