dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
142 stars 42 forks source link

add renamer() #2251

Open dankelley opened 3 weeks ago

dankelley commented 3 weeks ago

This relates to several other issues in which I've been sort of thinking out loud about how to address variable renaming (see e.g. #2238, #2239). It's confusing having so much discussion in so many issues, particularly because quite a lot of time has passed since the ideas first came to the fore. Therefore I am starting this new issue, and will be closing the others. (The links in sentence 1 will still work for the others, of course).

In a new branch called "rename", I am adding a new function called rename(). I do not intend to merge this stuff into the "develop" until it "feels right" to me. The general plan is as follows. Note that GH top-level views of issues only shows checklist completion if the list is in the first comment, so I will be revisiting the present issue over time, adding and subtracting content.

I will click items as I do them, and only push to GH when I have checked results.

Note that I am adding to the test suite as I do this work, to prevent problems of breaking thing N when working on thing N+1.

dankelley commented 3 weeks ago

Commit 6945773ad232e1042e794abb15f81b71c3dcbb3a of the "rename" branch seems to work reasonably well, on built-in tests and on the example. For the latter, which includes a CIOOS case, run the help for ?rename. (This is not in the online docs because they are attached only to the develop branch.)

dankelley commented 3 weeks ago

Below shows the "example(rename)" output. (I am putting this here because the online docs only get rebuilt when I push to either "main" or "develop", not to "rename", which is the branch in which I'm doing this work.

rename.md

dankelley commented 2 weeks ago

Today, I'm translating the approx 300-line conditional block in R/ctd.sbe.R into the new dictionary form (where it will become approx 100 lines). I am testing this in the oce-development repo. I want to be sure that the new scheme works, before I consider switching that block out for the dictionary method. (I certainly do not want to maintain both schemes.)

So far, with maybe a dozen lookups, things seem ok. But there are a LOT of regular expressions to get right.

And there is one nasty thing that will block me for a while: I have to get a regexp for that accented-e problem in some SBE files (including one in the test suite). In the present code, this is handled by not trying to match that character, but instead going past the name to the description of the name. But that's just because I got frustrated, trying to figure out a regexp. Surely there is a way. I am fiddling with this in isolation -- nowhere near the oce code.

dankelley commented 2 weeks ago

For more on the accented e problem, see https://github.com/dankelley/oce/issues/1977 (which is pretty long and I still don't really understand all the nuances).

dankelley commented 2 weeks ago

I'm going to go with what you see below. I don't think this will match any other sigma-ish things, and it prevents the problems of #1977, i.e.

  1. we cannot have non-ASCII characters in code
  2. we are told not to use useBytes=TRUE
  3. we need this to work on windows machines, with a wide range of encodings (and I think Canadians don't use the European encodings on the R test machines)
  4. the underlying R seems to have changed over time on these matters

And, with this, we can use the dictionary-style to rename things, without having to look in more detail at other things that appear in a * name = line in a CNV file.

PS. none of the test code is pushed to GH. I won't likely push until Sunday afternoon, and that will only be in the "rename" branch. I do want to get this working so I can forget about it during a busy time for classes, but I have no intention of changing read.ctd.sbe() until I am really sure I like this. And, even when I do, there will be a scheme with a new argument that will use the old code unless you set that new argument. I won't get into details here.

$ grep sigmaTheta ~/git/oce/inst/extdata/dictionary_sbe.csv
sigma-[^0:9]00,sigmaTheta,kg/m^3,
dankelley commented 1 week ago

I've added a test to the test suite, which shows that read.ctd.sbe(f) and read.oce(f)|>rename("sbe") yield the same results for a test CNV file that comes with the package.

Next I will run tests (and fix code as needed) to work with all the CNV files on my computer. I've got through 528 files so far but found a problem on the 529th, so I'll look into that. (It is a file that was made by a partial SBE toolchain.)

dankelley commented 1 week ago

Oh, this is because the file did not contain salinity, but read.oce() was adding it. Therefore, the object had salinity, but the names discovered by rename("sbe") did not include "salinity". (Remember, I said this was an intermediate file from an early step in the toolchain.)

dankelley commented 1 week ago

This is because read.oce() recognizes the file type and then calls read.ctd.sbe(), which notices that p, T and C are present but not S, and so creates S and inserts it with oceSetData(). But I think oceSetData() is not adding an entry to metadata$dataNamesOriginal.

dankelley commented 1 week ago

Looking at this again, I see that read.oce(filename, rename=FALSE) cannot make salinity, because it has no idea (with the whacky names) that it has something that stores temperature, etc. So I've altered my test code and will rebuild/rerun. Results in 1/2 hour.

dankelley commented 1 week ago

I've fixed the problem of column mismatch in the oce "rename" branch commit commit 2b538b5e06f0378a12d76655bf06411709688bd7; the oce-issues "main" branch commit f9b80b7366ba20da8180b11cda043873e21b9d03 exercises this on 1015 CNV files for data collected in the Beaufort Sea by IOS researchers. The test (in the 2241b.R file on the above-named oce-issues repo) is as below. It is pretty rigorous.

Also, I've fixed up some spelling problems, checked the URLs, and run the whole thing through the build-check cycle on my macOS machine, plus tested it remotely on the win-release machine used by devtools.

```R library(oce) #source("~/git/oce/R/oce.R") #source("~/git/oce/R/ctd.sbe.R") dir <- "~/data/arctic/beaufort/" nfiles <- 0L nfilesBad <- 0 if (dir.exists(dir)) { subdirs <- list.dirs(dir) for (subdir in subdirs) { files <- list.files(subdir, "*.cnv", full.names = TRUE) if (length(files) > 0) { cat(subdir, "\n") for (file in files) { nfiles <- nfiles + 1L d1 <- read.ctd.sbe(file) d2 <- try(read.oce(file, rename = FALSE)) if (inherits(d2, "try-error")) { cat("read.oce(\"", file, "\") failed\n") nfilesBad <- nfilesBad + 1 } else { d3 <- d2 |> rename("sbe") # Some IOS Beaufort Sea files (from an intermediate stage in pipeline) lack salinity addedSalinity <- FALSE if (!"salinity" %in% names(d3[["data"]])) { d3 <- oceSetData(d3, "salinity", swSCTp(d3[["conductivity"]] / 42.914, d3[["temperature"]], d3[["pressure"]]), unit = list(unit = expression(), scale = "PSS-78") ) addedSalinity <- TRUE } ok1 <- identical(d1[["metadata"]], d3[["metadata"]]) ok2 <- identical(d1[["data"]], d3[["data"]]) cat(sprintf( " %4d %s: metadata %s, data %s %s\n", nfiles, gsub(".*/", "", file), if (ok1) "ok" else "bad", if (ok2) "ok" else "bad", if (addedSalinity) "(after adding salinity column)" else "" )) if (!ok1 || !ok2) { nfilesBad <- nfilesBad + 1 cat("\n --- d1 summary ---\n") summary(d1) cat("\n --- d3 summary ---\n") summary(d3) stop("mismatch (remove this when all are ok)") } } } } } cat(sprintf("Of %d files examined; %d (%f %%) had rename() problems\n", nfiles, nfilesBad, 100 * nfilesBad/nfiles)) } ```