Autodect parameters from dataset -- add dataset accessor

rcaneill commented 2 years ago

This PR makes possible to give a dataset as argument to gsw functions, and based on cf standard name the cf_xarray wrapper will pick the proper arguments. Example:

import xarray as xr
import gsw_xarray as gsw

ds = xr.open_dataset('the_dataset.nc')
gsw.sigma0(ds.SA, ds.CT)
# Or you can now use
gsw.sigma0(ds=ds)

rcaneill commented 2 years ago

@DocOtak What do you think of the API (I haven't written the doc yet)? To use the option one can simply give a dataset as keyword argument, i.e. gsw.sigma0(ds=ds).

rcaneill commented 2 years ago

Many arguments don't have a cf standard name. It is possible to add custom criteria in cf-xarray, so we could write our own standard names (and open a request to add them in the cf-convention) https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html. Or another option is to make the autodetect option only available for the parameters that have a standard name. What do you think?

I would be in favor of the 1st option (our own criteria), this will make things way easier for #30 (regarding the solving of the graph)

DocOtak commented 2 years ago

@rcaneill apologies for delayed responses, I have been on a long holiday and have been attempting to actually "relax"... I'll be back "full time" next week.

I'm not sure about the gsw.func(ds=ds) type interface, all the functions (that I can recall) which take an entire Dataset does something to all variables in the dataset. Are you aware of any non "built in" numpy ufuncs that take entire datasets?

Elegant data selection is a hard problem faced by my seagoing group at Scripps... the functions that I have seen which take entire pandas dataframes are usually very messy.

rcaneill commented 2 years ago

I see your point. Would you then not be in favor of having this kind of detection, or is it more an API problem (that we could replace by something like ds.gsw['sigma0'] or ds.gsw.sigma0())?

rcaneill commented 2 years ago

Hello @DocOtak , I'm pinging you about my last question :)

DocOtak commented 2 years ago

Hi @rcaneill, greetings from pre cruise quarantine in Tahiti.

I think that we should avoid changing the function signature of the wrapped upstream functions and instead rely on the "accessor" pattern as you suggest: ds.gsw.sigma0().

I like the ds.gsw["sigma0"] form for simplicity. However, also supporting the function call way might allow some selection of which variable to use for input properties:

ds.gsw.sigma0(SP="psal1")

Where psal1 is one of many practical salinity variables we have in hypothetical dataset ds. The func then calculates SA using this psal variable as an input for sigma0.

rcaneill commented 2 years ago

I like your idea of using function call to use variable selection. That would easily tackle the issue of setting which variable to use when multiple fields are present with the same standard name. I'll try to give a try to add an accessor (when I'll have time)

Enjoy your future field work :)

rcaneill commented 2 years ago

I had some (hopefully good) idea while cycling to university today:

I like to keep the ability to use __getitem__ calls: ds.gsw['sigma0']. It makes it easy to e.g. loop through a list of strings of properties we want to compute
It makes sense to me to also allow to pass a list of strings to getitem: ds.gsw[['sigma0', 'sigma1']]
We keep the possibility of function call: ds.gsw.sigma0(SP="psal1")
To tackle the choice of e.g. default practical salinity (CTD, bottle, etc), we add a property/list/dict (as we prefer) on the gsw accessor with default variables to use:
```
ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1'
```
(This point is more relevant to #30 ) We can cache intermediate variables used for computation (if a variable needs steps in between), cf xarray doc on accessor. This can be set as an option (for users with small datasets it is a nice option, but for big datasets it can be a bad option), either through context, settings, or arguments: ds.gsw.sigma0(use_cache=False)

rcaneill commented 2 years ago

At commit e81063c we have a working version of the accessor! (we can't use getitem with lists, and there is no default option yet) Could you have a look?

rcaneill commented 2 years ago

(We can now use lists in getitem)

rcaneill commented 1 year ago

(cf https://github.com/DocOtak/gsw-xarray/issues/57#issuecomment-1377041549)

Roadmap of this PR:

[x] Make it work for functions using cf standard names, and let it fail for other functions

Optional:

[x] add detection of reference scale t90 and t68 (we assume all in situ temperature are ITS-90)
[x] deal with t argument that is temperature of sea-ice or of sea-water depending on the function's name
[x] deal with p which is in situ pressure of either water or ice, minus 10.1325 dbar (we do same as for t, the standard name sea_water_pressure is used, as e.g. in ARGO this is what is used for the pressure induced by water, i.e. is 0 at surface. There is no standard name for sea ice pressure.)
[x] Create a way to the user to configure the variable's names (e.g. gsw.set_non_cf_name({'entropy':'entropy_name_in_ds', 'SA_seaice':'salinity_of_sea_ice_name_in_ds'}) or ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1')

rcaneill commented 1 year ago

Note on t argument:

Used in ice

Helmholtz_energy_ice
adiabatic_lapse_rate_ice
alpha_wrt_t_ice
chem_potential_water_ice
cp_ice
enthalpy_ice
entropy_ice
gibbs_ice_part_t
internal_energy_ice
kappa_const_t_ice
kappa_ice
pressure_coefficient_ice
pt0_from_t_ice
pt_from_t_ice
rho_ice
sound_speed_ice
specvol_ice

used in sea-water

SA_freezing_from_t
SA_freezing_from_t_poly
CT_first_derivatives_wrt_t_exact
CT_from_t
C_from_SP
SP_from_C
alpha_wrt_t_exact
beta_const_t_exact
chem_potential_water_t_exact
cp_t_exact
dilution_coefficient_t_exact
enthalpy_t_exact
entropy_from_t
kappa_t_exact
pot_rho_t_exac
pt0_from_t
pt_from_t
rho_t_exact
sound_speed_t_exact
specvol_t_exact
t_deriv_chem_potential_water_t_exact
SP_salinometer
latentheat_evap_t

And so?

Good point for us, all the functions for sea-ice that use t have 'ice' in their names

rcaneill commented 1 year ago

Argo data don't have a reference_scale attribute for ITS-90 temperature (it is present in the long name). I am in favor in considering that all in situ temperature are ITS-90, unless the user explicitly uses the gsw.t90_from_t68 function

(attached is a screenshot of ncdump -h of an argo profile)

rcaneill commented 1 year ago

What is left:

[x] write doc
[x] fix doc and units
[x] there is problem: autocompletion doesn't work anymore in ipython. I may have broken something by rearranging the modules. I'll have a look

rcaneill commented 1 year ago

[x] Write doc for options

rcaneill commented 1 year ago

@DocOtak could you review? I think that it is soon ready to merge!

DocOtak commented 1 year ago

Sure thing! Have you tried this out on any "real world" datasets?

rcaneill commented 1 year ago

Sure thing! Have you tried this out on any "real world" datasets?

Not yet, I'll try with some ARGO profiles in the following days

rcaneill commented 1 year ago

While trying with some real ARGO data, I realized that ARGO still uses the old standard name for practical salinity (i.e. 'sea_water_salinity' instead of 'sea_water_practical_salinity'), so I added this case (tests are failing due to mismatch of poetry versions... I'll sort this)

rcaneill commented 1 year ago

I updated poetry to a newer version

rcaneill commented 1 year ago

The tests are failing as gsw has been updated and now includes gibbs and gibbs_ice functions. These 2 functions return different units depending on the order of the derivative. While we could parse the arguments to get the order of the derivatives, I open a new issue (#65 ) for this and leave it for later. I don't think that many people will use these 2 functions.

rcaneill commented 1 year ago

@DocOtak Ready to review! I added doc for the accessor in the README, and a notebook that uses some ARGO data to show all functionalities of gsw-xarray, including the new accessor.

DocOtak commented 1 year ago

First off, this is amazing work.

I played around with it a bit with my data from CCHDO. For CTD data, it worked pretty well since there tended to be only a single variable with a given standard name. The bottle data, being more complicated, wanted a bit more massaging to make work. I found myself wanting the set_non_cf_name to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables. For example, the argo example notebook you have. I somewhat expected this to work with the unfiltered argo data:

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"):
    ds.gsw.SA_from_SP()

Aside: An even crazier idea would be for this to accept some callable that gets passed the dataset and figured out the right variable. e.g. set_non_cf_name(p=some_function_that_knows_how_to_pick_the_pressure_in_argo_files)

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

rcaneill commented 1 year ago

I found myself wanting the set_non_cf_name to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables.

The way it is done now was on purpose (cf my next comment).

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"):
   ds.gsw.SA_from_SP()

About this, I agree it would make sense to make it possible. However, I think that it would be better to force the users to use standard names when available, i.e. an option that would look more like:

ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_cf_name_preference(
    sea_water_pressure="pres_adjusted",
    sea_water_practical_salinity="psal_adjusted",
    sea_water_temperature="temp_adjusted"
):
    ds.gsw.SA_from_SP()

Using standard names would tackle the issue that some arguments (e.g. t and p) exist for both sea water and sea ice.

However, as you mentioned, it makes sense that if the user sets set_non_cf_name he wants it to be an override. We can raise a warning if the user sets either p or t saying to be careful to not mix sea ice and sea water. We still need to prioritize the way we treat options: if one sets an option both with standard name and set_non_cf_name, I thin that option with standard name should have the priority, then set_non_cf_name, and finally our internal mapping.

rcaneill commented 1 year ago

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

About this, it is issue #30 that I decided to split into 2 different PR. The 1st one (this PR) to create the accessor and possibility to use the standard names, and a 2nd one to have the autodiscovery (to be done, but I had a good draft of the algo here https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378)

rcaneill commented 1 year ago

@DocOtak I updated the code with your comments. I let you try it with your data (or check the notebook about ARGO), but it should work smoothly :)

I use an internal function from cf_xarray for performance, I opened an issue there: https://github.com/xarray-contrib/cf-xarray/issues/465

rcaneill commented 1 year ago

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

Had an answer and problem solved!

DocOtak commented 1 year ago

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

rcaneill commented 1 year ago

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

Hm I don't remember but I guess that yes. As far as I remember (I did this notebook 1 year ago), in the end the easiest was not to find a path within the graph with one classical algorithm but simply brute force to find all possible variables that one can compute with a certain set of inputs.

DocOtak commented 1 year ago

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.
That one comment I made about missing name, I've had bad luck with making reviews actually visible to not me...

rcaneill commented 1 year ago

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.

I think that the API is stable if using the accessor + providing the name of each variable inside the dataset. If only relying on standard names (but practical salinity, as 2 standard names are used so it may raise issues) I think it is also robust. And the mixture of standard and non standard names using the options manager may need more focus. I tried to write good tests, but maybe we could write some more. We could raise a warning when the accessor is used, saying that it may be unable to autodetect the arguments in certain unusual cases (so versio 0.4.0 of gsw-xarray would have the warning, maybe also 0.4.1 if 0.4.1 adds the autodiscovery, and 0.4.2 would be considered as stable if we have enough feedback)

If you have time to document our intent, I'm pleased if you do so. I can try to write some more tests + resolve your comment about modules

rcaneill commented 1 year ago

This comment is to list all tests with should have for the accessor. Feel free to add more configurations if needed. I'll try to implement them tomorrow:

Basic tests

[x] give names
[x] with standard names (getitem)
[x] with standard names (method)

More advanced

[x] with standard names + option to choose from dataset which array to pick (getitem)
[x] with standard names + option to choose from dataset which array to pick (method)
[x] with non cf names + option set to say which array to pick (method)
[x] mixture of standard names + decide with array to pick + non cf names (method)

With Pint

[x] Test with a pint quantified dataset (method)
[x] Test with a pint quantified dataset (getitem)

These tests should be done with the method ds.gsw.sigma0() and the getitem ds.gsw["sigmao"]

rcaneill commented 1 year ago

[x] We should add in the doc that the names we provide to the accessor with getitem are the names or the underlying function, and not the name of the variables (this is confusing as many function have the same name as the output variable). This is probably clear when using the method, as it is an explicit call to a function.

DocOtak commented 1 year ago

Any reason to not merge (so we can not let this prevent forward movement)?

rcaneill commented 1 year ago

I think it was ready, I was waiting for your review but it seems that I forgot to ask you... So if you can have a last check, if seems good you can merge afterward

DocOtak commented 1 year ago

We can do any cleanups in future PRs I think, tests are passing and I don't think we have broken any currently used public interfaces. Maybe we could mark the "auto detector" as experimental. I also suspect a refactor of things might be worth it eventually, there are some deep indentations that are indicative of a "code smell" but I am not worried about that for now.

DocOtak / gsw-xarray