dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
142 stars 42 forks source link

top 100m of water column missing in section plots #1472

Closed bck243 closed 5 years ago

bck243 commented 5 years ago

Short summary of problem

oce version 1.0.1 fails to plot top 100m of water column when plotting a section with ztype= "image"

Details (optional)

I'm using R version 3.4.1

What you did

I ran the example code:

library(oce)
data(section)
sg <- sectionGrid(section)
## 1. start of section, default fields.
plot(head(section))
## 2. Gulf Stream
GS <- subset(section, 109<=stationId&stationId<=129)
GSg <- sectionGrid(GS, p=seq(0, 2000, 100))
plot(GSg, which=c(1, 99), map.ylim=c(34, 42))
par(mfrow=c(2, 1))
plot(GS, which=1, ylim=c(2000, 0), ztype='points',
     zbreaks=seq(0,30,2), pch=20, cex=3)
plot(GSg, which=1, ztype='image', zbreaks=seq(0,30,2))

par(mfrow=c(1, 1))

What you expected to happen

When run with oce version 0.9.21 on R version 3.3.2, the section plots normally

What happened

When run on updated version, top 100 meters are blank on section diagram (see image) top100_blank

How urgent is this?

Non-urgent, would be nice to have fix in a few months

Output from sessionInfo()

sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] oce_1.0-1 gsw_1.0-5 testthat_2.0.1

loaded via a namespace (and not attached): [1] compiler_3.4.1 magrittr_1.5 R6_2.2.2 tools_3.4.1 Rcpp_0.12.17 rlang_0.2.1

dankelley commented 5 years ago

Thanks for the bug report. I'll have a look sometime this week, or perhaps on Saturday.

dankelley commented 5 years ago

I think it's in the gridding, not the plotting, because

 GSg[["station"]][[1]]

(and similar for station 2, 3, ...) yields

ctd object has data as follows.
   pressure:   0, 100, ..., 1900, 2000 (length 21)
   temperature:     NA, 21.517, ..., 3.9063, 3.8674 (length 21)
   salinity:    NA, 36.69, ..., 34.951, 34.958 (length 21)
   salinityBottle:   NA, 36.7, ..., 34.956, 34.961 (length 21)
   oxygen:     NA, 204.93, ..., 272.31, 272.08 (length 21)
   silicate:     NA, 1.7152, ..., 14.501, 15.047 (length 21)
   nitrite:       NA, 0.078397, ..., NA, NA (length 21)
   NO2+NO3:      NA, 0.48026, ..., 17.068, 17.213 (length 21)
   phosphate:       NA, 0.011403, ..., 1.0337, 1.0427 (length 21)
dankelley commented 5 years ago

I've made a test file at https://github.com/dankelley/oce-issues/blob/master/14xx/1472/1472a.R and when I've fixed this bug, that file will run cleanly, first checking the top point in each station, and doing the plots only if these are all non-NA.

dankelley commented 5 years ago

PS. question for @richardsc and @clayton33 at the end...

The problem is that the top pressure in these CTD casts is not zero, but rather a positive number. This matters because the default value for the method argument of the sectionGrid() function is method="approx", which means that ctdDecimate() gets called with that same value, which in turn means that approx() is used to do the interpolation of salinity, temperature, etc., up to the surface at pressure=0. And, the default action of approx() is to return NA as "y" outside the range of the "x" that is provided. (I'm sorry if this is a convoluted explanation.)

A solution will be for me to alter how approx() gets called by ctdDecimate(), so that the value nearest the surface is used to extrapolate to the surface. I can make it so that the top value gets repeated all the way up to the surface, forming a fake mixed layer. However, I am not convinced that I like this as a default action for sectionGrid(), partly because it would be changing something that oce has done for many years. I will instead cook up a new name, perhaps method="approxWithMixedLayer" (although that's way too long, I know). That way, people who don't like seeing a thin blank layer at the top of a section would have a way to get around it.

QUESTION TO OTHER DEVELOPERS: do you like this proposed solution? If so, can you think of a nicer name for method?

bck243 commented 5 years ago

Thanks for the quick response, Dan! I wonder what the solution was to this problem in oce version 0.9.21? Here is an image of how the plot looks in that version. It seems to plot the actual value of T rather than a mixed layer approximation.

old_oce_version rplot

dankelley commented 5 years ago

Excellent question -- I think we used to use method="boxcor" as the default, in the older versions. You can see that by issuing the following in a terminal, once you've cloned oce:

git diff 2ce86a481f8e2562c379c5bd80ba2a7877e7bcd3 R/ctd.R

part of the result is below

#' @family things related to \code{ctd} data
-ctdDecimate <- function(x, p=1, method="boxcar", e=1.5, debug=getOption("oceDebug"))
+ctdDecimate <- function(x, p=1, method="boxcar", rule=1, e=1.5, debug=getOption("oceDebug"))
 {

Note, though, that I am not quite sure when I changed from 0.9.21 to what we have now, but

git diff 2ce86a481f8e2562c379c5bd80ba2a7877e7bcd3 DESCRIPTION

contains

Version: 0.9-21
-Date: 2016-11-18
+Type: Package
 Title: Analysis of Oceanographic Data
-Authors@R: c(person(given="Dan", family="Kelley", email="Dan.Kelley@Dal.Ca", role=c("aut", "cre")),
- person(given="Clark", family="Richards", email="clark.richards@gmail.com", role=c("aut")),
- person(given="Chantelle", family="Layton", email="chantelle.layton@dal.ca", role=c("ctb"), comment="curl() coauthor"),
- person(given="British Geological Survey", role=c("ctb","cph"), comment="magnetic-field subroutine"),
- person(given="Pablo", family="Valdés", role=c("ctb"), comment="Spanish translation of messages"))
+Version: 1.0-2.1

so the change must have been sometime between commit 2ce86a481f8e2562c379c5bd80ba2a7877e7bcd3 and now.

dankelley commented 5 years ago

In commit 9320d7b75a4ce21f4029f29be943ba26fd55c66d of the "develop" branch of oce, the test code

library(oce)
data(section)
GS <- subset(section, 109<=stationId&stationId<=129)
GSg <- sectionGrid(GS, p=seq(0, 2000, 100), debug=3, method="approxML")
stations <- GSg[["station"]]

for (istation in seq_along(stations)) {
    cat("station ", istation, "\n")
    station <- stations[[istation]]
    expect_true(is.finite(station[["temperature"]][1]))
    expect_true(is.finite(station[["salinity"]][1]))
    cat("  ... top level is non-NA, so this station passes the test\n")
}

if (!interactive()) png("1472a.png")
par(mfrow=c(2, 1))
plot(GS, which=1, ylim=c(2000, 0), ztype='points',
     zbreaks=seq(0,30,2), pch=20, cex=2)
plot(GSg, which=1, ztype='image', zbreaks=seq(0,30,2))
if (!interactive()) dev.off()

produces as shown below. The idea is to use method="approxML" now, to get constant-extrapolation from the uppermost data point to the uppermost grid level. I did not make this the default, because I don't think the default should be to create data (which is what we are doing, in assigning a value at p=0).

Note that the help page for sectionGrid has an example showing how to do this. Thus, an easy way to see the new scheme is to look at the final two-panel graph created by example(sectionGrid).

To the reporter: please look this over, at your convenience of course, and either add a comment asking for more help, or otherwise add a comment and then close the issue. (In the oce group, we avoid having the developers close issues ... this is a way to make it more likely that the reporter has got an answer that is helpful.)

Thanks for pointing this out, and for the great detective work in identifying when it used to work. 1472a

bck243 commented 5 years ago

Sorry, I still don't understand why the solution would need to extrapolate from the mixed layer when there is data for the top 100m. Maybe in the example data the first datapoint is at 100m (If I understand the ctd pressure output you posted properly), but when I consider my own data, I have hundreds of temperature observations in the top 100m starting at 1.9 dbar, but the top 100m still don't plot:

library(oce)
d <- read.oce('https://cchdo.ucsd.edu/data/12735/33RO20161119_hy1.csv')
plot(d, which='temperature', xtype='distance', ztype='image', ylim = c(200,0))

temp rplot

In my case, would the fix extrapolate from 1.9 dbar or still from 100m? For my purposes, the surface layer is important (I have several genetics samples above the mixed layer for which accurate metadata matters) so I'll have to find another means of plotting sections if this is the only solution.

Thanks again for all your help!

dankelley commented 5 years ago

This is because the default action for sectionGrid() is to call ctdDecimate(..., method="approx"), which ends up using the built-in function approx() to do linear approximation between ungridded y=y(x) data to a gridded scheme yg=yg(xg). Linear approximation to some particular xg requires data on each side of xg. That's the problem. There are no data "above" p=0.

If you were to set the p argument to sectionGrid(), instead of letting it choose a default based on the deepest station, you could make the whited-out zone be 50m, or 10m, or whatever you like ... but it would never extrapolate above the shallowest sample for a given station.

What this new scheme, used by sectionGrid(..., mthod="approxML") does, is to go outside the data range, using constant values. Thus, if your top data point is T=10deg at p=8dbar, it would extend that value, extrapolating to give you 10deg at the surface where p=0dbar.

To get this new action, you need to grid explicitely by calling sectionGrid(). Your example, in the just-previous comment, of course is using a defaulted call to sectionGrid(). And, as I said, my choice is not to alter that default call. I just don't like defaults that invent data. If I were you, I would do e.g.

dg <- sectionGrid(d, p=c(2, 5, 10, 15, 20, seq(50, 5000, 50))

or something (maybe even p="levitus"). The point of this would be to have a fine grid near the surface. I would prefer a ragged white-out zone at the surface, rather than doing the mixed-layer scheme that I just added, because then I could see that we had failed data in the top 50m in some station or another, etc. This depends a lot on the type of data. For example, physical data are very often on 1m resolution, demanding one approach, but chemistry (bottle) data might only have a handful of datus in the whole profile, demanding another approach.

I'm not sure if I'm explaining this very well. Some of the relevant concepts of e.g. approx(), and a little about sections, are in my book [Dan E. Kelley, Oceanographic Analysis with R (New York: Springer-Verlag, 2018)], but of course the book doesn't have this new feature 😄

bck243 commented 5 years ago

The sectionGrid() solution works great, thank you! And thanks for the additional explanation!

dankelley commented 5 years ago

I've enhanced standardDepths() so that it can do subgridding, and included this as an example in the help provided by ?sectionGrid. Now, the final graph produced by example(sectionGrid) is as shown. Note that using p=standardDepths(5) is may be option for a lot of datasets that result from bottle samples (i.e. coarse datasets), because it will show a ragged whiteout region, respecting the data better than the method="approxML" scheme that I added yesterday. (I think both have merits, though, and a wide range of schemes is likely good because of a wide range of data, including bottle and ctd data, tow-yow data, glider data, etc.)

screen shot 2019-01-09 at 11 15 14 am
bck243 commented 5 years ago

Great, thank you!