Closed rcaneill closed 1 year ago
@DocOtak What do you think of the API (I haven't written the doc yet)? To use the option one can simply give a dataset as keyword argument, i.e. gsw.sigma0(ds=ds)
.
Many arguments don't have a cf standard name. It is possible to add custom criteria in cf-xarray, so we could write our own standard names (and open a request to add them in the cf-convention) https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html. Or another option is to make the autodetect option only available for the parameters that have a standard name. What do you think?
I would be in favor of the 1st option (our own criteria), this will make things way easier for #30 (regarding the solving of the graph)
@rcaneill apologies for delayed responses, I have been on a long holiday and have been attempting to actually "relax"... I'll be back "full time" next week.
I'm not sure about the gsw.func(ds=ds)
type interface, all the functions (that I can recall) which take an entire Dataset does something to all variables in the dataset. Are you aware of any non "built in" numpy ufuncs that take entire datasets?
Elegant data selection is a hard problem faced by my seagoing group at Scripps... the functions that I have seen which take entire pandas dataframes are usually very messy.
I see your point. Would you then not be in favor of having this kind of detection, or is it more an API problem (that we could replace by something like ds.gsw['sigma0']
or ds.gsw.sigma0()
)?
Hello @DocOtak , I'm pinging you about my last question :)
Hi @rcaneill, greetings from pre cruise quarantine in Tahiti.
I think that we should avoid changing the function signature of the wrapped upstream functions and instead rely on the "accessor" pattern as you suggest: ds.gsw.sigma0()
.
I like the ds.gsw["sigma0"]
form for simplicity. However, also supporting the function call way might allow some selection of which variable to use for input properties:
ds.gsw.sigma0(SP="psal1")
Where psal1 is one of many practical salinity variables we have in hypothetical dataset ds. The func then calculates SA using this psal variable as an input for sigma0.
I like your idea of using function call to use variable selection. That would easily tackle the issue of setting which variable to use when multiple fields are present with the same standard name. I'll try to give a try to add an accessor (when I'll have time)
Enjoy your future field work :)
I had some (hopefully good) idea while cycling to university today:
__getitem__
calls: ds.gsw['sigma0']
. It makes it easy to e.g. loop through a list of strings of properties we want to computeds.gsw[['sigma0', 'sigma1']]
ds.gsw.sigma0(SP="psal1")
ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1'
ds.gsw.sigma0(use_cache=False)
At commit e81063c we have a working version of the accessor! (we can't use getitem with lists, and there is no default option yet) Could you have a look?
(We can now use lists in getitem)
(cf https://github.com/DocOtak/gsw-xarray/issues/57#issuecomment-1377041549)
Roadmap of this PR:
Optional:
t
argument that is temperature of sea-ice or of sea-water depending on the function's namep
which is in situ pressure of either water or ice, minus 10.1325 dbar
(we do same as for t
, the standard name sea_water_pressure
is used, as e.g. in ARGO this is what is used for the pressure induced by water, i.e. is 0 at surface. There is no standard name for sea ice pressure.)gsw.set_non_cf_name({'entropy':'entropy_name_in_ds', 'SA_seaice':'salinity_of_sea_ice_name_in_ds'})
or ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1'
)Note on t
argument:
Good point for us, all the functions for sea-ice that use t
have 'ice' in their names
Argo data don't have a reference_scale attribute for ITS-90 temperature (it is present in the long name). I am in favor in considering that all in situ temperature are ITS-90, unless the user explicitly uses the gsw.t90_from_t68
function
(attached is a screenshot of ncdump -h
of an argo profile)
What is left:
@DocOtak could you review? I think that it is soon ready to merge!
Sure thing! Have you tried this out on any "real world" datasets?
Sure thing! Have you tried this out on any "real world" datasets?
Not yet, I'll try with some ARGO profiles in the following days
While trying with some real ARGO data, I realized that ARGO still uses the old standard name for practical salinity (i.e. 'sea_water_salinity' instead of 'sea_water_practical_salinity'), so I added this case (tests are failing due to mismatch of poetry versions... I'll sort this)
I updated poetry to a newer version
The tests are failing as gsw has been updated and now includes gibbs
and gibbs_ice
functions. These 2 functions return different units depending on the order of the derivative. While we could parse the arguments to get the order of the derivatives, I open a new issue (#65 ) for this and leave it for later. I don't think that many people will use these 2 functions.
@DocOtak Ready to review! I added doc for the accessor in the README, and a notebook that uses some ARGO data to show all functionalities of gsw-xarray, including the new accessor.
First off, this is amazing work.
I played around with it a bit with my data from CCHDO. For CTD data, it worked pretty well since there tended to be only a single variable with a given standard name. The bottle data, being more complicated, wanted a bit more massaging to make work. I found myself wanting the set_non_cf_name
to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables. For example, the argo example notebook you have. I somewhat expected this to work with the unfiltered argo data:
ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"):
ds.gsw.SA_from_SP()
Aside: An even crazier idea would be for this to accept some callable that gets passed the dataset and figured out the right variable. e.g. set_non_cf_name(p=some_function_that_knows_how_to_pick_the_pressure_in_argo_files)
What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t()
calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"]
if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]
. If you think this is too difficult, don't worry about it for this round of changes.
I found myself wanting the
set_non_cf_name
to be an override, since it is only checked after failing to find a variable with the standard name, you will get an exception about multiple variables.
The way it is done now was on purpose (cf my next comment).
ds = xr.open_dataset('ARGO_example.nc') with gsw.set_non_cf_name(p="pres_adjusted", SP="psal_adjusted", t="temp_adjusted"): ds.gsw.SA_from_SP()
About this, I agree it would make sense to make it possible. However, I think that it would be better to force the users to use standard names when available, i.e. an option that would look more like:
ds = xr.open_dataset('ARGO_example.nc')
with gsw.set_cf_name_preference(
sea_water_pressure="pres_adjusted",
sea_water_practical_salinity="psal_adjusted",
sea_water_temperature="temp_adjusted"
):
ds.gsw.SA_from_SP()
Using standard names would tackle the issue that some arguments (e.g. t
and p
) exist for both sea water and sea ice.
However, as you mentioned, it makes sense that if the user sets set_non_cf_name
he wants it to be an override. We can raise a warning if the user sets either p
or t
saying to be careful to not mix sea ice and sea water.
We still need to prioritize the way we treat options: if one sets an option both with standard name and set_non_cf_name
, I thin that option with standard name should have the priority, then set_non_cf_name
, and finally our internal mapping.
What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.
About this, it is issue #30 that I decided to split into 2 different PR. The 1st one (this PR) to create the accessor and possibility to use the standard names, and a 2nd one to have the autodiscovery (to be done, but I had a good draft of the algo here https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378)
@DocOtak I updated the code with your comments. I let you try it with your data (or check the notebook about ARGO), but it should work smoothly :)
I use an internal function from cf_xarray for performance, I opened an issue there: https://github.com/xarray-contrib/cf-xarray/issues/465
I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465
Had an answer and problem solved!
Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?
Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?
Hm I don't remember but I guess that yes. As far as I remember (I did this notebook 1 year ago), in the end the easiest was not to find a path within the graph with one classical algorithm but simply brute force to find all possible variables that one can compute with a certain set of inputs.
I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:
I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:
- Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.
I think that the API is stable if using the accessor + providing the name of each variable inside the dataset. If only relying on standard names (but practical salinity, as 2 standard names are used so it may raise issues) I think it is also robust. And the mixture of standard and non standard names using the options manager may need more focus. I tried to write good tests, but maybe we could write some more. We could raise a warning when the accessor is used, saying that it may be unable to autodetect the arguments in certain unusual cases (so versio 0.4.0 of gsw-xarray would have the warning, maybe also 0.4.1 if 0.4.1 adds the autodiscovery, and 0.4.2 would be considered as stable if we have enough feedback)
If you have time to document our intent, I'm pleased if you do so. I can try to write some more tests + resolve your comment about modules
This comment is to list all tests with should have for the accessor. Feel free to add more configurations if needed. I'll try to implement them tomorrow:
These tests should be done with the method ds.gsw.sigma0()
and the getitem ds.gsw["sigmao"]
Any reason to not merge (so we can not let this prevent forward movement)?
I think it was ready, I was waiting for your review but it seems that I forgot to ask you... So if you can have a last check, if seems good you can merge afterward
We can do any cleanups in future PRs I think, tests are passing and I don't think we have broken any currently used public interfaces. Maybe we could mark the "auto detector" as experimental. I also suspect a refactor of things might be worth it eventually, there are some deep indentations that are indicative of a "code smell" but I am not worried about that for now.
This PR makes possible to give a dataset as argument to gsw functions, and based on cf standard name the cf_xarray wrapper will pick the proper arguments. Example: