dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
142 stars 42 forks source link

consider supporting seabird .btl files #1681

Closed clayton33 closed 4 years ago

clayton33 commented 4 years ago

There has been a recent (order of hours) interest to start using .btl files that are output during data conversion when running the SBE data processing software. The .btl files summarize data recorded when bottles are fired. I'm not sure how much these files are used by other users, but having the ability read in these files will eventually help hook up water sample data with the CTD data. These files are apart of standard seabird output when water sampling is involved, so I don't foresee this data format file going anywhere anytime soon.

Documentation of this file format is probably included in the seabird data processing manual that you have, but I also have access to older versions. It seems like seabird is only providing a manual for the most recent version of the processing software (https://www.seabird.com/software). The .btl files that I have access to were not created using the most recent version, so I can send that manual along.

If other developers think this is worth while then I can send along a number of .btl files, as well as the manual in an e-mail.

(tagging @Xandac and @atcogswell so they can follow along)

dankelley commented 4 years ago

This looks very useful, thanks. I'd be happy to have a go at reading the files in oce.

The first step would be if someone could send me some data files. (Also, please email me the manual, of course.)

clayton33 commented 4 years ago

Sent an e-mail with a handful of .btl files from the past few years from various cruises, along with older manuals. The files were created using Seasave Version 7.26.6.26

dankelley commented 4 years ago

I got the files. The docs do not mention a format, so I have started looking at the files. Actually, read.ctd.sbe() reads the header part OK, but it chokes when it comes to the actual data. This is because it discovers column names from header lines like

# name 0 = scan: scan number
# name 1 = timeS: time [s]
# name 2 = pr: pressure [db]
# name 3 = depS: depth, salt water [m]

etc, and no such lines are present in these .btl files. However, there is some information on column contents in the two lines before the columns, as e.g. from a sample file

    Bottle        Date Sbeox0ML/L Sbeox1ML/L      Sal00      Sal11 Potemp068C Potemp168C  Sigma-é00  Sigma-é11       Scan      TimeS       PrDM      T068C      C0S/m      T168C      C1S/m       AltM    Par/log    Sbeox0V    Sbeox1V    FlSPuv0       FlSP         Ph TurbWETbb0       Spar   Latitude  Longitude
  Position        Time                                                                                                                                                                                                                                                                                              
      1    Apr 07 2019     7.0638     6.0483    30.9447    30.9292     0.8312     0.8306    24.7946    24.7822      15914    663.042     65.012     0.8335   2.664024     0.8329   2.662773       6.55 1.0708e-01     2.7315     2.1285     4.5351 1.0022e+00      8.063 1.6605e-03 3.2912e+00   44.69454  -63.64062 (avg)
              11:51:11                                                                                                 14      0.595      0.032     0.0001   0.000016     0.0001   0.000022       0.02 1.6951e-04     0.0003     0.0570     0.2750 2.0383e-01      0.001 2.4408e-04 1.0988e-03    0.00000    0.00000 (sdev)

I can adjust read.ctd.sbe() to work with this. I don't really think it should be a whole new function, since the rest of the header seems to be read acceptably by read.ctd.sbe().

The thing that makes it a bit tricky is that the header is two lines, and each data element is also two lines (scroll the above to see that they have lined things up by spaces -- jeeze Louise, what kind of person does that?)

dankelley commented 4 years ago

I had a look at https://github.com/castelao/seabird which is python code that is said to read cnv and btl files, and I see there an assumption that the column names are 11 chars wide.

That's the pattern I'm seeing in my sample files, except that (as in the sample file for https://github.com/castelao/seabird) the first item, "Bottle", character 11 is the space past the name. So, anyway, it looks as though I may be safe to assume 11 chars per column, so long as I rename the "Bottle" column.

Oh, and I'm not going to use that "Position" name anyway, nor the "Time" name, because the second line of names is the same in all the files I've examined, so I am disinclined to write 5 lines of code to figure it out.

dankelley commented 4 years ago

I've done something that is worth trying, in "develop" commit ea934e2e84aabff34f4c04d55acbedc72b5129c0. The docs explain how it works; see the item on the newly-added btl argument for read.ctd.sbe().

Below is a test code

library(oce)
files <- list.files(pattern="*.BTL")   # e.g. "001A001.BTL"
for (i in seq_along(files)) {
    d <- read.ctd.sbe(files[i], btl=TRUE)
    summary(d)
    cat("\nT036C stddv/avg: ", d[["T068C_sdev"]] / d[["T068C"]], "\n")
}

Note that read.ctd.sbe() is not (for now, at least) trying to rename the data items. What it does is to use the names that are in the file. For the standard-deviation part of the data, it tacks on a _sdev to the column name. You can see this in action in the cat() call.

@clayton33 is this useful, do you think? (I am not sure what the purpose is.) I can probably use existing code to change those variable names into e.g. "salinity" and so forth, if you really want that, but I think it may be best to stick to the names that are in the seabird file, anyway. (Note that you can always use [[ on a ctd object with either the "nice" name, like "temperature" or the original one, like

library(oce)
data(ctd)
ctd[["t068"]]

but notice there that the original name does not match the name in your file. In your file, the initial letter is "T", not "t", an also your file ends with a "C" but the data(ctd) does not. This might be because of a change in SBE format -- I don't know.

clayton33 commented 4 years ago

Thanks, I had a feeling it would be similar to read.ctd.sbe , I'll get back to you on the file format after the long weekend, but for now it looks pretty good. I'll close for now and re-open if necessary. Might be good to code the variable names into "nice" names. I'll check if the names in these files are standard seabird output names or if it something on our end.

clayton33 commented 4 years ago

A handful of us have compared .cnv files to .btl files and it appears that the only difference between the header names is that the first letter in the .btl file header row is capitalized. Not sure why this is, maybe they haven't compared these files in a while to identify the differences. So it might be worth a test to see how well the cnvName2oceName does when the first letter is changed to lowercase ? It might make sense to do this for consistency since it is a seabird file, and its being done to .cnv files?

However, if you think that the original names should stay, then disregard this, close it, and call me crazy.

dankelley commented 4 years ago

Actually, @clayton33, my main concern is not the difficulty with creating the names. Rather, it has to do with duplicated fields. It's common to e.g. have two temperature sensors. What read.ctd.sbe() does is to assign the first one to appear in the header (which is the first column, reading columns from left to right) as temperature and the second one as temperature2, and similarly for any field that gets duplicated.

So, if we were sure that the .btl file always listed all the same fields, in the same order, then the temperature from the CNV would correspond to the temperature in the BTL. But what if the BTL temperature corresponds to the second temperature in the CNV file, i.e. to temperature2 in the ctd object made for the CNV file? Then things would be very confusing for the user.

However, with a ctd object, you can always access using the "originalName", as e.g.

library(oce)
data(ctd)
ctd[["t068"]]

and so my feeling on this is to leave the BTL output with these original names, because then there is no way for the user to get confused.

I'm not sure if this makes sense. Does it?

Also, if we know that the BTL file always lists the same things, in the same order -- something I'd need a manual for, to be sure -- then we can certainly convert using cnvName2oceName(), perhaps with lower-casing.

It seems like I'm resisting, but my goal is to avoid getting users into a trap because of names that don't correspond between CNV and BTL files.

On another matter, can you tell me what you actually do with the BTL data? I assume it's for some sort of calibration or something?

clayton33 commented 4 years ago

Ah ok, ok. I see, I see. I'll try to see if there is any documentation on the above. Might take me a few days.

The .btl data is going to be used to hook up water sample data with what the ctd was sampling at the time that the bottle was fired.

\begin{DFO BIO history lesson} For some time, after a CTD profile was taken, a .QAT file was created and captured similar things to a .btl file, but without all the metadata, which was recently discovered. The goal is to start to use these .btl files instead of continuing with the .QAT files as variable names in the .btl files are standardized across all the various CTD setups, so it would make processing a bit easier and faster and is going to help us standardize things across various cruises. Right now we're trying to streamline some things, and this is our first step. \end{DFO BIO history lesson}

dankelley commented 4 years ago

Thanks for the lesson -- that's very helpful. Maybe you and @richardsc could talk about this issue of whether to rename things read from BTL files.

For me, too, no rush.

clayton33 commented 4 years ago

I had a closer look at the difference between the .cnv variables and the .btl variables. There's actually a convention for naming primary and secondary temperature sensors, e.g. t068 is the primary and then t168 is the secondary, so there should never be a mix-up if we were to rename them, and i'm guessing it's the same thing that you do when you read in a .cnv file ??

dankelley commented 4 years ago

Hm. Right now, read.ctd.sbe() names things by the order in the header, not by whether the first digit after the characters is 0, 1 etc. The relevant lines in code are R/ctd.sbe.R line 819 et seq., viz.

   for (iline in seq_along(nameLines)) {
        nu <- cnvName2oceName(lines[nameLines[iline]], columns, debug=debug-1)
        ##newname <- unduplicateName(nu$name, colNamesInferred)
        ##colNamesInferred <- c(colNamesInferred, newname)
        ## dataNamesOriginal[[newname]] <- nu$nameOriginal
        if (nu$name %in% namesUsed) {
            trial <- 2
            while (paste(nu$name, trial, sep="") %in% namesUsed) {
                trial <- trial + 1
                ##message("trial=", trial)
                if (trial > 10)
                    break
            }
            ## message("** REUSING NAME '", nu$name)
            nu$name <- paste(nu$name, trial, sep="")
            ##message("  -> '", nu$name, "'")
        }
        namesUsed <- c(namesUsed, nu$name)
        dataNamesOriginal[[nu$name]] <- nu$nameOriginal
        colUnits[[iline]] <- nu$unit
        colNamesInferred <- c(colNamesInferred, nu$name)
        ##message("SBE name=", nu$name, "; nameOriginal=", nu$nameOriginal, "; unit='", as.character(nu$unit$unit),"'")
    }
    colNamesInferred <- unduplicateNames(colNamesInferred)

where unduplicateNames() is in R/misc.R (but I think it likely doesn't do anything, because the code above won't have duplicates ... so I think the unduplicateNames call is a remnant.

Anyway, if (that's if) the header lists e.g. t068 before t168 then what you say is true: the first will be called temperature and the second will be called temperature.

But what if there is a t090C entry, in addition to t068 and t168 entries? Which do we call temperature, and which become temperature2 and temperature3? In the code at the moment, we solve that question in a simple way: the first one named in the header is temperature, the second is temperature2 and the third is temperature3. The naming follows the header.

And that's the worry I have. If the BTL file is "synched up" with the CNV file, meaning that the variables are listed in the same order, then things will be fine. But, if the order differs, the present naming scheme will lead to mismatches. I think that would be confusing to the user. Of course, there would be no problems if the user worked with original names, and not "oce" names. That's why I lean towards only using original names for BTL data. My assumption is that BTL data will get used by experts, who know that e.g. t068 is on one scale, and t090 is on a different scale.

I am not sure if this is clear or not. At this stage, I think you (and maybe @richardsc) are likely to know a lot more about the form of these data than I do. In case it would be easier than typing, I could set up a zoom meeting at 10AM (maybe with @richardsc also?)

clayton33 commented 4 years ago

Ah, sorry, I missed this. I recently haven't been good about keeping a gmail tab open! Let's switch over to e-mail and figure out a time.

clayton33 commented 4 years ago

after our virtual f2f meeting, the decision was to retain the original variable names.