DocOtak / gsw-xarray

Wrapper for gsw that will add CF attributes to xarray.DataArray outputs
https://gsw-xarray.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
23 stars 3 forks source link

Autodect parameters from dataset -- add dataset accessor #53

Closed rcaneill closed 1 year ago

rcaneill commented 2 years ago

This PR makes possible to give a dataset as argument to gsw functions, and based on cf standard name the cf_xarray wrapper will pick the proper arguments. Example:

import xarray as xr
import gsw_xarray as gsw

ds = xr.open_dataset('the_dataset.nc')
gsw.sigma0(ds.SA, ds.CT)
# Or you can now use
gsw.sigma0(ds=ds)
rcaneill commented 2 years ago

@DocOtak What do you think of the API (I haven't written the doc yet)? To use the option one can simply give a dataset as keyword argument, i.e. gsw.sigma0(ds=ds).

rcaneill commented 2 years ago

Many arguments don't have a cf standard name. It is possible to add custom criteria in cf-xarray, so we could write our own standard names (and open a request to add them in the cf-convention) https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html. Or another option is to make the autodetect option only available for the parameters that have a standard name. What do you think?

I would be in favor of the 1st option (our own criteria), this will make things way easier for #30 (regarding the solving of the graph)

DocOtak commented 2 years ago

@rcaneill apologies for delayed responses, I have been on a long holiday and have been attempting to actually "relax"... I'll be back "full time" next week.

I'm not sure about the gsw.func(ds=ds) type interface, all the functions (that I can recall) which take an entire Dataset does something to all variables in the dataset. Are you aware of any non "built in" numpy ufuncs that take entire datasets?

Elegant data selection is a hard problem faced by my seagoing group at Scripps... the functions that I have seen which take entire pandas dataframes are usually very messy.

rcaneill commented 2 years ago

I see your point. Would you then not be in favor of having this kind of detection, or is it more an API problem (that we could replace by something like ds.gsw['sigma0'] or ds.gsw.sigma0())?

rcaneill commented 2 years ago

Hello @DocOtak , I'm pinging you about my last question :)

DocOtak commented 2 years ago

Hi @rcaneill, greetings from pre cruise quarantine in Tahiti.

I think that we should avoid changing the function signature of the wrapped upstream functions and instead rely on the "accessor" pattern as you suggest: ds.gsw.sigma0().

I like the ds.gsw["sigma0"] form for simplicity. However, also supporting the function call way might allow some selection of which variable to use for input properties:

ds.gsw.sigma0(SP="psal1")

Where psal1 is one of many practical salinity variables we have in hypothetical dataset ds. The func then calculates SA using this psal variable as an input for sigma0.

rcaneill commented 2 years ago

I like your idea of using function call to use variable selection. That would easily tackle the issue of setting which variable to use when multiple fields are present with the same standard name. I'll try to give a try to add an accessor (when I'll have time)

Enjoy your future field work :)

rcaneill commented 2 years ago

I had some (hopefully good) idea while cycling to university today:

rcaneill commented 2 years ago

At commit e81063c we have a working version of the accessor! (we can't use getitem with lists, and there is no default option yet) Could you have a look?

rcaneill commented 2 years ago

(We can now use lists in getitem)

rcaneill commented 1 year ago

(cf https://github.com/DocOtak/gsw-xarray/issues/57#issuecomment-1377041549)

Roadmap of this PR:

Optional:

rcaneill commented 1 year ago

Note on t argument:

Used in ice

used in sea-water

And so?

Good point for us, all the functions for sea-ice that use t have 'ice' in their names

rcaneill commented 1 year ago

Argo data don't have a reference_scale attribute for ITS-90 temperature (it is present in the long name). I am in favor in considering that all in situ temperature are ITS-90, unless the user explicitly uses the gsw.t90_from_t68 function

(attached is a screenshot of ncdump -h of an argo profile) image

rcaneill commented 1 year ago

What is left:

rcaneill commented 1 year ago
rcaneill commented 1 year ago

@DocOtak could you review? I think that it is soon ready to merge!

DocOtak commented 1 year ago

Sure thing! Have you tried this out on any "real world" datasets?

rcaneill commented 1 year ago

Sure thing! Have you tried this out on any "real world" datasets?

Not yet, I'll try with some ARGO profiles in the following days

rcaneill commented 1 year ago

While trying with some real ARGO data, I realized that ARGO still uses the old standard name for practical salinity (i.e. 'sea_water_salinity' instead of 'sea_water_practical_salinity'), so I added this case (tests are failing due to mismatch of poetry versions... I'll sort this)

rcaneill commented 1 year ago

I updated poetry to a newer version

rcaneill commented 1 year ago

The tests are failing as gsw has been updated and now includes gibbs and gibbs_ice functions. These 2 functions return different units depending on the order of the derivative. While we could parse the arguments to get the order of the derivatives, I open a new issue (#65 ) for this and leave it for later. I don't think that many people will use these 2 functions.

rcaneill commented 1 year ago

@DocOtak Ready to review! I added doc for the accessor in the README, and a notebook that uses some ARGO data to show all functionalities of gsw-xarray, including the new accessor.

DocOtak commented 1 year ago

First off, this is amazing work.

I played around with it a bit with my data from CCHDO. For CTD data, it worked pretty well since there tended to be only a single variable with a given standard name. The bottle data, being more complicated, wanted a bit more massaging to make work. I found myself wanting the set_non_cf_name to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables. For example, the argo example notebook you have. I somewhat expected this to work with the unfiltered argo data:

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"):
    ds.gsw.SA_from_SP()

Aside: An even crazier idea would be for this to accept some callable that gets passed the dataset and figured out the right variable. e.g. set_non_cf_name(p=some_function_that_knows_how_to_pick_the_pressure_in_argo_files)

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

rcaneill commented 1 year ago

I found myself wanting the set_non_cf_name to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables.

The way it is done now was on purpose (cf my next comment).

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"):
   ds.gsw.SA_from_SP()

About this, I agree it would make sense to make it possible. However, I think that it would be better to force the users to use standard names when available, i.e. an option that would look more like:

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_cf_name_preference(
    sea_water_pressure="pres_adjusted",
    sea_water_practical_salinity="psal_adjusted",
    sea_water_temperature="temp_adjusted"
):
    ds.gsw.SA_from_SP()

Using standard names would tackle the issue that some arguments (e.g. t and p) exist for both sea water and sea ice.

However, as you mentioned, it makes sense that if the user sets set_non_cf_name he wants it to be an override. We can raise a warning if the user sets either p or t saying to be careful to not mix sea ice and sea water. We still need to prioritize the way we treat options: if one sets an option both with standard name and set_non_cf_name, I thin that option with standard name should have the priority, then set_non_cf_name, and finally our internal mapping.

rcaneill commented 1 year ago

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

About this, it is issue #30 that I decided to split into 2 different PR. The 1st one (this PR) to create the accessor and possibility to use the standard names, and a 2nd one to have the autodiscovery (to be done, but I had a good draft of the algo here https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378)

rcaneill commented 1 year ago

@DocOtak I updated the code with your comments. I let you try it with your data (or check the notebook about ARGO), but it should work smoothly :)

I use an internal function from cf_xarray for performance, I opened an issue there: https://github.com/xarray-contrib/cf-xarray/issues/465

rcaneill commented 1 year ago

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

Had an answer and problem solved!

DocOtak commented 1 year ago

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

rcaneill commented 1 year ago

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

Hm I don't remember but I guess that yes. As far as I remember (I did this notebook 1 year ago), in the end the easiest was not to find a path within the graph with one classical algorithm but simply brute force to find all possible variables that one can compute with a certain set of inputs.

DocOtak commented 1 year ago

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

rcaneill commented 1 year ago

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

  • Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.

I think that the API is stable if using the accessor + providing the name of each variable inside the dataset. If only relying on standard names (but practical salinity, as 2 standard names are used so it may raise issues) I think it is also robust. And the mixture of standard and non standard names using the options manager may need more focus. I tried to write good tests, but maybe we could write some more. We could raise a warning when the accessor is used, saying that it may be unable to autodetect the arguments in certain unusual cases (so versio 0.4.0 of gsw-xarray would have the warning, maybe also 0.4.1 if 0.4.1 adds the autodiscovery, and 0.4.2 would be considered as stable if we have enough feedback)

If you have time to document our intent, I'm pleased if you do so. I can try to write some more tests + resolve your comment about modules

rcaneill commented 1 year ago

This comment is to list all tests with should have for the accessor. Feel free to add more configurations if needed. I'll try to implement them tomorrow:

Basic tests

More advanced

With Pint

These tests should be done with the method ds.gsw.sigma0() and the getitem ds.gsw["sigmao"]

rcaneill commented 1 year ago
DocOtak commented 1 year ago

Any reason to not merge (so we can not let this prevent forward movement)?

rcaneill commented 1 year ago

I think it was ready, I was waiting for your review but it seems that I forgot to ask you... So if you can have a last check, if seems good you can merge afterward

DocOtak commented 1 year ago

We can do any cleanups in future PRs I think, tests are passing and I don't think we have broken any currently used public interfaces. Maybe we could mark the "auto detector" as experimental. I also suspect a refactor of things might be worth it eventually, there are some deep indentations that are indicative of a "code smell" but I am not worried about that for now.