dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
143 stars 42 forks source link

maybe as.oce() should get a 'renamer' parameter #2238

Closed dankelley closed 3 weeks ago

dankelley commented 2 months ago

I just noticed that it looks for SBE-type names (PRES forpressure, etc), which made sense when I thought those names might be so common as to be nearly universal. But, now that I'm looking at ERDDAP data, I see that there is a whole other suite of names.

Yesterday, I gave read.netcdf() a parameter called renamer, so that if you use that, then as.ctd() will work quite well.

But maybe as.ctd() ought to get a renamer argument as well. Its default could be for things to continue working as they are. But then, for example, a person might do

as.ctd(object, renamer=woce)
as.ctd(object, renamer=SBE)
as.ctd(object, renamer=DFO)
as.ctd(object, renamer=somethingElseMadeByTheUser)

I've not thought this through, but I see the pros as

  1. old behaviour is kept
  2. users can choose between different standard renamer vocabularies (we already have several in oce)
  3. users can easily make up their own vocabularies so that oce does not need future modifications to handle new naming conventions

As for cons, it's just a bit of programming. But it would remove some fairly ugly code that's in as.ctd() at the moment, and it should be easily verified to work.

Perhaps @richardsc and I can discuss this in person, sometime.

dankelley commented 1 month ago

I am just thinking out loud here, recording notes in a place that's easy to find.

My first thought had been of interceding in the read* code to do the renaming before the original-name mapping and flag-name mappings are set up. I will need to go into the code to remove that stuff, as a conversion is done to the new system. But now, I realize that the best way to do the renaming is at the very end of the processing within the read* functions, and to do that with a new function that users could call themselves.

Suppose that new function is called renamer(). I am not worried on the exact name right now. The idea would be that the user might read data in whatever way seems fit -- maybe reading csv, maybe with readLines, maybe with ncdf4, whatever -- to create an object likely with weird variable names taken from the original file. Let's call the user's function F(), say. then they might do e.g.

F(filename) |> renamer("SBE") |> as.ctd()

where renamer() recognizes that string as a request to do the seabird mapping. Notice that as.ctd() does not need a new argument.

Renamer would have a bunch of built-in schemes (like "SBE", above) but the user could also give it the name of a yaml (or similar file) or a list.

Here's an example working process.

First, read the data somehow into an oce object.

Second, do summary() to see var names. Notice that something called bark didn't get renamed.

Third, do

myReadFunction(filename) |>
    renamer(list("dogSound"="bark")) |>
    otherFunctionsInChain()

Or, with more than a few translations, put them in a yaml (or similar) file and provide that filename to renamer().

dankelley commented 1 month ago

Possibly a useful resource

$ cioos-siooc_data_transform git:(odf_transform) pwd
/Users/kelley/git/cioos-siooc_data_transform

$ cioos-siooc_data_transform git:(odf_transform) git grep bodc|grep ZZ
cioos_data_transform/cioos_data_transform/OceanNcVar.py:                bodc_code = "PSLTZZ01"
cioos_data_transform/cioos_data_transform/OceanNcVar.py:                bodc_code = "DOXYZZ"
cioos_data_transform/cioos_data_transform/OceanNcVar.py:                bodc_code = "DOXMZZ"
cioos_data_transform/cioos_data_transform/OceanNcVar.py:                bodc_code = "CNDCZZ"
dankelley commented 1 month ago

I'm fiddling with this a bit today. The work is being done in another repo (https://github.com/dankelley/oce-development/) because if I do it in sandbox/dk within the oce repo, then the whole package gets rebuilt and retested on every git-push operation, which seems like a silly waste of electricity.

Below is a checklist for one of the ODF/nc files I'll be using to start. Note that at https://www.youtube.com/watch?v=x1l6BJO9Kos you can see how I made this checklist in neovim ... also, you can see how lazyvim handles "sessions", which is a great feature that I never knew about before.

dankelley commented 1 month ago

I've had this in mind for a month, and no longer think it makes sense for a lot of functions to gain a new renamer argument. The better scheme would be e.g.

d <- file |> read.netcdf() |> rename() |> as.ctd()

where this rename() function is half-written in the www.github.com/dankelley/oce-development repo. This means that read and as functions will not need alteration. As for rename(), or whatever it gets called, it would let the user supply a built-in or user-defined dictionary (or a combination!) that gets used in renaming variables, flags, and units (plus other things I might be forgetting at the moment, writing this).

A key thing is that those dictionaries will be in an easily edited format. This would make it easier for users who want to submit patches if they see missing things in built-in dictionaries, without having to wait for a new CRAN release.

Over time, some read functions could evolve to use dictionaries, removing some of the special-purpose coding. But there are subtleties in such things. For example, in some conventions, the variable name species units, but the data file also gives units. So, code will have to allow for units being given or not, and for unit contradictions or not, and so forth.

I won't go into more details, except to say that there are more details to think about.

The simplest attack surface, as above, is read.netcdf(), which I'm having to edit to handle files on erddap servers (etc.). Everywhere I turn, there is a new naming scheme. And the connection between variable names and flag names can be very confusing (although, as discussed with @richardsc, this might be a problem of my looking at file formats that are still in development, as institutions set up erddap servers).

dankelley commented 3 weeks ago

This is superseded by #2251, so I'm closing it so that discussion can be more focussed.