Reading file from BaseX2 CTD

ashleystanek commented 2 years ago

Hello Dr. Kelley and Dr. Richards, I am trying to load data from my ctd into oce but am running into some issues getting through the first step.

I have a BaseX2 from AML Oceanographic with temperature, pressure, and conductivity sensors. Using the software that comes with the instrument (SeaCast) I can export the data in several formats, but when trying to import them using read.oce or read.ctd, I receive an error saying the filetype is "unknown" and I can't find any mention of the filetypes I can create in the documentation for oce. I can export to the following formats: 1) a csv that includes the same header as in the attached file attached but with the data columns in any order, 2) PDS2000 (.txt) 3) Kongsberg (.asvp) 4) CARIS (.svp), 5) HYPAK (.vel) 6) HYPAK 2015 (.vel) 7) HiPap (.usr) 8) Sonardyne (.pro) 9) QINSy (.csv).

I have attached two file types of the same dataset (I've removed a chunk of the rows so it doesn't contain the whole cast), but run into the same issue with both.
Custom export 026043_2021-07-21_17-36-45.txt Exported format 026043_2021-07-21_17-36-45.csv

library(oce)

filefolder <- ""  # Add filepath here

# These first two lines read the file that was on the instrument (Original 
# format):
# When I call read.oce I receive an error saying the filetype is "unknown"
read.oce(paste0(filefolder, "Original format 026043_2021-07-21_17-36-45.csv"))

# When trying read.ctd it says it cannot determine the file type in the first row
read.ctd(paste0(filefolder, "Original format 026043_2021-07-21_17-36-45.csv"))

# These next lines read the file after the data was read by the software used
# by the instrument, SeaCast, and then exported as a .txt (Exported format):
read.oce(paste0(filefolder, "Custom export 026043_2021-07-21_17-36-45.txt"))
read.ctd(paste0(filefolder, "Custom export 026043_2021-07-21_17-36-45.txt"))

Output from sessionInfo(): R version 4.1.2 (2021-11-01) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252
system code page: 65001

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] oce_1.5-0 gsw_1.0-6

loaded via a namespace (and not attached): [1] compiler_4.1.2 tools_4.1.2 Rcpp_1.0.8

Thank you for the help, Ashley

richardsc commented 2 years ago

Thanks for using oce @ashleystanek!

We can definitely take a look. Briefly, there are typically two kinds of manufacturer files that we encounter: ones that are predictable, contain suitable metadata, and are described enough that we don't need to make (dangerous) guesses about what things are; and files that are just crude exports of the data-only portion of whatever was recorded. Generally I advise against trying to adapt oce to read in the latter, because the code can be difficult to maintain and challenging to get right. What I usually recommend for the latter is a "custom read function" that the user can use to read the parts of the file into a oce object, without actually importing such a function into the package.

That being said, from a quick look at the file exports you sent, it looks like the AML files would likely fall into the former category. There is quite a bit of metadata, and it appears as though the data columns are described, with names and units.

It appears that the csv file is almost identical to the txt file, but only the txt file actually identifies the column names (in the very first line), e.g.

Date,Time,Conductivity (mS/cm),Temperature (C),Depth (m),Battery (V)

DISPLAY OPTIONS                 
[Instrument]                    
Type=Base.X2                    
EmulationMode=disabled                  
UseCustomHeader=yes                 
...

So, I think it's worth some discussion about whether we include a read.ctd.aml() function in oce. If I were voting, I'd probably vote +0.75, as I think the file format looks reasonable (though not completely trivial owing to the need to grep for lots of data/metadata things), I actually have an AML that I use from time to time, however I'm not 100% confident that the txt export format won't change in the future.

It's too late for me now to try coding this up, but since I know @dankelley will be up early I've put your two data files into the oce-issues repo at https://github.com/dankelley/oce-issues/tree/main/19xx/1924 😂

dankelley commented 2 years ago

I'll be a copycat and vote +0.75 as well. (@ashleystanek, we use a voting scheme in which are numbers between -1 and +1, so that a decision is found simply by adding votes.). I wrote most of the existing oce code, so it makes sense that I look at this. I'll do so, today, perhaps before the Bedford Institute of Oceanography seminar.

dankelley commented 2 years ago

I've had a look, and put a test file into the github directory that Clark mentioned. @ashleystanek you can clone that directory from git@github.com:dankelley/oce-issues.git, if you want to keep abreast of things. There's no strong need to, because I'm putting the gist below (click the Details word to see it all).

Please note that I am not yet reading any of the metadata, beyond the first line (from which I infer column names). Also, I am not parsing those column names fully, e.g. not allowing for the (small, I hope) possibility that temperature might be measured in degF or something whacky like that, or that depth not be in metres. A proper function would have checks on such things, e.g. maybe the user asked for pressure to be saved, not depth.

Ashley, if this seems promising, I can expand on this (e.g. looking into the metadata more) and rework this into a proper oce function, named (as Clark suggests) read.ctd.aml().

# R file ```R library(oce) f <- "Custom.export.026043_2021-07-21_17-36-45.txt" lines <- readLines(f, encoding="UTF-8-BOM", warn=FALSE) lines[1] col.names <- strsplit(lines[1], ",")[[1]] #Comments= # #2021-07-21,17:36:46.17,03.107,05.867,0.21,007.80 #2021-07-21,17:36:46.21,06.010,05.986,0.25,007.80 CommentsLine <- grep("^Comments=", lines) CommentsLine data <- read.csv(f, skip=CommentsLine+1, header=FALSE, col.names=col.names) # check head(data,2) T <- data[["Temperature..C."]] C <- data[["Conductivity..mS.cm."]] p <- swPressure(data[["Depth..m."]]) # check head(data.frame(T,C,p), 2) S <- swSCTp(conductivity=C, temperature=T, pressure=p, conductivityUnit="mS/cm", eos="unesco") # check head(data.frame(T,C,p,S), 2) # If you want to use the new Gibbs equation of state, you need # to put longitude and latitude into the object. But these data look # to be very fresh, so I sort of doubt that this EOS makes sense. Anyway, # I'm inserting a North-Atlantic location, which is likely wrong for you. ctd <- as.ctd(S, T, p, longitude=-60, latitude=40) # Let's get an idea for what's in the file. (My particular interest # is whether the S makes sense.) summary(ctd) if (!interactive()) png("01dk.png") plot(ctd) if (!interactive()) dev.off() ``` # Text output (please check) ``` [1] "Date,Time,Conductivity (mS/cm),Temperature (C),Depth (m),Battery (V)" [1] 99 Date Time Conductivity..mS.cm. Temperature..C. Depth..m. 1 2021-07-21 17:36:46.17 3.107 5.867 0.21 2 2021-07-21 17:36:46.21 6.010 5.986 0.25 Battery..V. 1 7.8 2 7.8 T C p 1 5.867 3.107 0.2117184 2 5.986 6.010 0.2520457 T C p S 1 5.867 3.107 0.2117184 2.619186 2 5.986 6.010 0.2520457 5.271249 CTD Summary ----------- * Location: 40N 60W * Data Overview Min. Mean Max. Dim. NAs OriginalName scan 1 162.5 324 324 0 - salinity [PSS-78] 2.6192 20.599 25.578 324 0 - temperature [°C, ITS-90] 1.494 4.909 6.21 324 0 - pressure [dbar] 0.21172 1.4004 2.4499 324 0 - * Processing Log - 2022-03-04 11:45:18.264 UTC: `create 'ctd' object` - 2022-03-04 11:45:18.264 UTC: `as.ctd(salinity = S, temperature = T, pressure = p, longitude = -60, latitude = 40)` null device 1 ``` # Plot (again, please check) ![01dk](https://user-images.githubusercontent.com/99469/156757904-653d43a7-3053-44dc-892e-3f7d23587481.png)

richardsc commented 2 years ago

Considering that @ashleystanek is in Alaska, and you've inferred the location to be the western North Atlantic, I think you missed the lines in the file that give lon/lat:

Latitude=70.3184                                        
Longitude=-147.7644

😄 . Or, maybe you just didn't bother trying to read that yet?

I like that the AML ctds have an integrated GPS, actually.

dankelley commented 2 years ago

There's new code in the repo now, called 02dk.R (below). It reads some metadata. I think we need some advice from Ashley as to which metadata are worth reading.

I noticed that longitude and latitude appear twice, each. I guess that's the start and stop of a profile, or something. Anyway, for now, I just use the first instance of each.

I also decode the time of observation, which I guess must be the start time.

I don't see any real problem in inserting this into oce this morning, and I'll do so, unless Clark objects. (We like to vote on including things.). Of course I'd write up a bit of documentation, etc.

NOTE: I am not trying to handle the cases of different conductivity units. Nor am I trying to decode the blocks that relate to the "Slot"s. There is a whole lot of stuff in there.

I'm putting below the details, as before. It's actually almost as much work formatting this stuff as coding, so I think from now on I'll assume that Ashley and Clark will just look in the oce-issues repo for things, and that they will run 'make' there to reproduce results.

# R code ```R # 02dk.R: read.ctd.aml() initiial trial. NEXT: PI test; document; insert into oce. library(oce) file <- "Custom.export.026043_2021-07-21_17-36-45.txt" read.ctd.aml <- function(file, debug=getOption("oceDebug")) { oceDebug(debug, "read.ctd.aml() {\n", unindent=1, style="bold") if (!missing(file) && is.character(file) && 0 == file.info(file)$size) stop("empty file") filename <- "" if (is.character(file)) { filename <- fullFilename(file) file <- file(file, "r") on.exit(close(file)) } if (!inherits(file, "connection")) stop("argument `file' must be a character string or connection") if (!isOpen(file)) { open(file, "r") on.exit(close(file)) } getMetadataItem <- function(lines, name, numeric=TRUE) { l <- grep(paste0("^",name,"="), lines) if (length(l) > 0L) { # NOTE: we take first definition, ignoring others item <- lines[l[1]] res <- strsplit(lines[l], "=")[[1]][2] if (numeric) res <- as.numeric(res) else res <- trimws(res) } else { NULL } res } lines <- readLines(file, encoding="UTF-8-BOM", warn=FALSE) # FIXME: add other relevant metadata here. This will require some # familiarity with the typical contents of the metadata. For example, # I see 'SN' and 'BoardSN', and am inferring that we want to save # the first, but maybe it's the second... longitude <- getMetadataItem(lines, "Longitude") latitude <- getMetadataItem(lines, "Latitude") # I serialNumber <- getMetadataItem(lines, "SN") Date <- getMetadataItem(lines, "Date", numeric=FALSE) Time <- getMetadataItem(lines, "Time", numeric=FALSE) time <- as.POSIXct(paste(Date, Time), tz="UTC") col.names <- strsplit(lines[1], ",")[[1]] CommentsLine <- grep("^Comments=", lines) oceDebug(debug, "CommentsLine=", CommentsLine, "\n") data <- read.csv(text=lines, skip=CommentsLine+1, header=FALSE, col.names=col.names) T <- data[["Temperature..C."]] C <- data[["Conductivity..mS.cm."]] p <- swPressure(data[["Depth..m."]]) S <- swSCTp(conductivity=C, temperature=T, pressure=p, conductivityUnit="mS/cm", eos="unesco") rval <- as.ctd(S, T, p, longitude=longitude, latitude=latitude, time=time, serialNumber=serialNumber) rval@metadata$filename <- filename oceDebug(debug, "} # read.ctd.aml() {\n", unindent=1, style="bold") rval } ctd <- read.ctd.aml(file) summary(ctd) if (!interactive()) png("02dk.png") plot(ctd) if (!interactive()) dev.off() ``` # R output ``` R version 4.1.2 (2021-11-01) -- "Bird Hippie" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin17.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > # 02dk.R: read.ctd.aml() initiial trial. NEXT: PI test; document; insert into oce. > library(oce) > file <- "Custom.export.026043_2021-07-21_17-36-45.txt" > read.ctd.aml <- function(file, debug=getOption("oceDebug")) + { + oceDebug(debug, "read.ctd.aml() {\n", unindent=1, style="bold") + if (!missing(file) && is.character(file) && 0 == file.info(file)$size) + stop("empty file") + filename <- "" + if (is.character(file)) { + filename <- fullFilename(file) + file <- file(file, "r") + on.exit(close(file)) + } + if (!inherits(file, "connection")) + stop("argument `file' must be a character string or connection") + if (!isOpen(file)) { + open(file, "r") + on.exit(close(file)) + } + getMetadataItem <- function(lines, name, numeric=TRUE) + { + l <- grep(paste0("^",name,"="), lines) + if (length(l) > 0L) { + # NOTE: we take first definition, ignoring others + item <- lines[l[1]] + res <- strsplit(lines[l], "=")[[1]][2] + if (numeric) + res <- as.numeric(res) + else + res <- trimws(res) + } else { + NULL + } + res + } + lines <- readLines(file, encoding="UTF-8-BOM", warn=FALSE) + # FIXME: add other relevant metadata here. This will require some + # familiarity with the typical contents of the metadata. For example, + # I see 'SN' and 'BoardSN', and am inferring that we want to save + # the first, but maybe it's the second... + longitude <- getMetadataItem(lines, "Longitude") + latitude <- getMetadataItem(lines, "Latitude") + # I + serialNumber <- getMetadataItem(lines, "SN") + Date <- getMetadataItem(lines, "Date", numeric=FALSE) + Time <- getMetadataItem(lines, "Time", numeric=FALSE) + time <- as.POSIXct(paste(Date, Time), tz="UTC") + col.names <- strsplit(lines[1], ",")[[1]] + CommentsLine <- grep("^Comments=", lines) + oceDebug(debug, "CommentsLine=", CommentsLine, "\n") + data <- read.csv(text=lines, skip=CommentsLine+1, header=FALSE, col.names=col.names) + T <- data[["Temperature..C."]] + C <- data[["Conductivity..mS.cm."]] + p <- swPressure(data[["Depth..m."]]) + S <- swSCTp(conductivity=C, temperature=T, pressure=p, conductivityUnit="mS/cm", eos="unesco") + rval <- as.ctd(S, T, p, longitude=longitude, latitude=latitude, + time=time, serialNumber=serialNumber) + rval@metadata$filename <- filename + oceDebug(debug, "} # read.ctd.aml() {\n", unindent=1, style="bold") + rval + } > > ctd <- read.ctd.aml(file) > summary(ctd) CTD Summary ----------- * File: "/Users/kelley/git/oce-issues/19xx/1924/Custom.export.026043_2021-07-21_17-36-45.txt" * Location: 70.318N 147.76W * Time: 2021-07-21 17:36:45 * Data Overview Min. Mean Max. Dim. NAs OriginalName scan 1 162.5 324 324 0 - salinity [PSS-78] 2.6192 20.599 25.578 324 0 - temperature [°C, ITS-90] 1.494 4.909 6.21 324 0 - pressure [dbar] 0.21172 1.4004 2.4499 324 0 - * Processing Log - 2022-03-04 13:11:29.619 UTC: `create 'ctd' object` - 2022-03-04 13:11:29.619 UTC: `as.ctd(salinity = S, temperature = T, pressure = p, time = time, serialNumber = serialNumber, longitude = longitude, latitude = latitude)` > if (!interactive()) png("02dk.png") > plot(ctd) > if (!interactive()) dev.off() null device 1 > ``` # Plot ![02dk](https://user-images.githubusercontent.com/99469/156770471-b7b41f83-7d94-4127-992f-71a21fd64aff.png)

dankelley commented 2 years ago

Update: 03dk.R now has a rough draft of documentation. It is ready for inclusion into oce, and I plan to do that, unless I get a "hold off" message from @richardsc before noon today, Halifax time.

dankelley commented 2 years ago

Update: 04dk.R now stores the whole header in the metadata, so a user can access any of that information for themselves.

richardsc commented 2 years ago

A couple quick notes while I pull the above to check:

    T <- data[["Temperature..C."]]
    C <- data[["Conductivity..mS.cm."]]
    p <- swPressure(data[["Depth..m."]])

The header does actually specify units, so a "smart" approach would be to parse them and enter them in the [['units']] field accordingly. But I'm not sure if the software allows export of different units, in which case there would be no point.

The other point is that depth =/= pressure, so we should be sure that we are storing the correct field in the ctd object (which prefers to have pressure, but will create it from depth if necessary).

dankelley commented 2 years ago

I plan to do the following and then put the new fcn into oce before 1400h (run and lunch in meantime)

parse units for conductivity, etc
see if pressure is given, and use that, or otherwise (as now) use depth. (Note that the code is computing pressure from this depth, presently.)

richardsc commented 2 years ago

Oh, you're right about computing pressure -- I looked too quickly and didn't see the swPressure() call. Ooops! Time for another coffee ...

dankelley commented 2 years ago

I'm just out to finish that run that I couldn't complete in the snow. After that I'll get lunch and then do those changes ... but if you say "hold on" I won't incorporate it into oce.

richardsc commented 2 years ago

No, fine to incorporate. We'll make changes after it's in there anyway.

Have a good run!

dankelley commented 2 years ago

Quick Q: is there any pattern in the filename that I could use within oceMagic()? I could do a combination:

filename ends in .txt
and first line starts with Date,Time I suppose, but this might conflict with other data files. I'm wondering whether the exporter perhaps always makes a filename starting with Custom.export or something.

Hm, maybe I ought to insist that the first line be as at 2 above, but also that the third line contains DISPLAY OPTIONS.

NOTE: I think either Clark or Ashley might be able to answer this. Clark will know why I like to have oceMagic() work, and also why I am quite cautious about adding new possibilities to it, for fear of breaking the recognition of existing file patterns.

dankelley commented 2 years ago

PS. I would prefer to fix this stuff before incorporating because then testing is much faster.

dankelley commented 2 years ago

Also ... is read.ctd.aml() the right name, or maybe read.ctd.seacast() or something? I know I could look through emails and do web searches to find such things but likely C can answer quickly. Offline for an hour now.

richardsc commented 2 years ago

I'm not sure about the filename/file patterns, but this is something we could look into more. I have to confess that when we use our AML ctd, it isn't used as a "proper CTD" where we download the data to look at profiles and other details. It's more of a spot-check on water density related to our glider ballasting.

As for the name, I think that read.ctd.aml() is the right approach, because it's consistent with other read.ctd*() formats, like:

read.ctd.itp         read.ctd.odv         read.ctd.woce
read.ctd.odf         read.ctd.sbe         read.ctd.woce.other

where the suffix is either a manufacturer or organization or instrument type.

ashleystanek commented 2 years ago

Wow! Thank you for jumping in to make this work! I glanced through the conversation but will have to take time this afternoon to try to understand it.

Regarding the text file - it is from a custom export format where you can specify the column order, and I believe you can have different units for the variables as well. I'll dig into this this afternoon and clarify some of the options. I don't have any understanding of other CTDs so maybe this is normal, but with the AML device, you can have a host of other instruments attached, so the data section is likely to be very different depending on each setup. I like the custom text format because it specifies the data columns in the first row.

I'll send along another export option, a "Seacast" csv file, which to me seems like the most obvious export option from their list. However, I didn't include it initially because it doesn't include the data column names or units. When I started bringing the data into R, I had to add them manually, which seems very odd and prone to error.

GPS - the second set of coordinates under the DataX header are the ones that should be used. I don't know what is different between them, but AML confirmed to use the second set. I had issues this summer where the second set of coordinates were blank because the device couldn't get a lock so if any function were to read them, it should be possible to add them in later manually. Yep, I'm in Alaska and this data is from our work up in the Beaufort Sea!

Thank you! Cheers, Ashley

dankelley commented 2 years ago

However, I didn't include it initially because it doesn't include the data column names or units.

I think it's best to stick with formats that have those things. I'm going to add what I have to oce. To access it, you'll need to be set up to build R packages from source. The README on the oce website has some hints on that. There are web resources, too (including some written by @richardsc) ... basically, your system needs to have compilers for Fortran, C and C++. Stay tuned. It will be in oce within about half an hour.

dankelley commented 2 years ago

@ashleystanek can I ask you to read through the attached (trial docs for the new function) to see if it's clear? As you can see, I've coded something that is very demanding on file formats. This can be relaxed later, but first I wanted to get something that you can try out for your actual case. Dealing with a large possibility of data types is challenging because we need to have lots of tests on things like

is temperature spelled with a "T" or a "t"?
are units put in "()" or in "[]"?
are there accents on any of the characters? (In French, the second "e" has an accent.)
what unit is used for conductivity? (There are at least 3 choices.)
what unit is used for depth?
if depth is not given, is "z"?
If neither depth nor "z" is given, is pressure given?
if more than one of (depth, z, pressure) is given, which one do we use?

and so forth. Coding for such things is labour-intensive, and all the more so if there is no documentation to tell us which things are possible.

If you are ever in a pinch, you can copy the gist of what I have in read.ctd.aml(), viz. find the "Comments:" line, and skip one more line, then read the columns. Use names as given by the first line in the file. Then do whatever you want, with as.ctd() or some other function.

Attached is the trial documentation. It's pretty rough, but perhaps you can have a look and make edits on it. Then I can just edit the source.

read.ctd.aml.pdf

ashleystanek commented 2 years ago

Fantastic! I'll give it a try once I learn to compile the package.

dankelley commented 2 years ago

Update: 06dk.R uses the "develop" branch of oce as of commit ffe5c1c4db96b41b759022e542e320586eaa42da.

@ashleystanek: you'll need to build oce from source to run this code. If you do, you'll see that it now stores times as well as temperature, etc.

A couple of other points:

I edited 2 comments you made a while back. It looks like you commented by replying to an email from github. I recommend against doing that, because then the comment becomes quite confusing, with quotes of previous comments, etc. Also, you had provided contact information in your email, and I don't think it's a great idea to do that on github.
I have not made oceMagic() recognize this file because I don't know what patterns I can use to do that. Therefore, you have to call read.ctd.aml() directly. (If oceMagic() knows a file pattern then read.oce() can be used. This is set up for dozens of file types. But, to set this up, I'd really need some documentation about the file format ... otherwise I'm just guessing, and the problem with that is the possibility of false positives, breaking code for other users.)

ashleystanek commented 2 years ago

@dankelley I was able to get the develop version of the package loaded from github and can read in data files with read.ctd.aml()! I spent a bit of time today looking more through SeaCast to find some of the options that may complicate reading files, such as unit preferences and automatically converting dBar to depth. I've attached a screenshot of the main program settings and also the settings for the custom export options. As I was exporting files again the default CSV format now includes the data column headers and units. If I can figure out why it didn't work before, I would think it would be better to use a format that doesn't have the variability that comes with the custom .txt file.

I'll work through comments on the documentation and the specific formatting questions next week, and dig into some of the oce functions.

Thank you! Ashley

SeaCast Settings SeaCast Export Settings

dankelley commented 2 years ago

It would help to get precise statements here about how things change if you change those settings. The files are textual (either .txt or .csv) and so it will be easy for you to do.

Making the function handle both .txt and .csv seems like a hassle, to be honest. For example, I have a sub-function that acquires metadata, and it works by reading a line of text, finding a "=" character, taking the text to the left of that as the metadata name, and the text to the right as the metadata value. So, that function would need a new argument telling it whether the file is csv or txt, and it would have to act differently in the two cases. This is not difficult, but I just point it out as an example that handling different file formats imposes a coding burden.

The same goes for the unit possibilities. Consider the depth/pressure entry. Presumably there is something that tells whether the user selected "freshwater" or "seawater" formula. So that must be (or should be!) buried somewhere in the metadata, which is another thing to consider and have to code for. My advice: always export pressure, never depth. The instrument measures pressure, and that is always what's used in oceanographic research. I don't think we support depth in any other read.ctd functions, for example, and no formula I'm aware of uses depth. Depth only comes up in making plots, and even then, I would not use it except in making a graph for the layperson.

The good thing is that I don't see anything in the interface about switching units of conductivity, or using [ versus ( around units. This reduces the programming burden quite a lot. Also, it doesn't look like you can make it call something "Temp" or "T" or whatever, instead of "Temperature", and that helps a lot. (For the more popular seabird instruments, a lot of oce code is dedicated to interpreting column names, which have hundreds of possibilities.)

Here are some tasks I'd like you to do, to help in this. I am hoping to cover a fair bit of ground with these, and to do so in an organized way that will not require a lot of comments back-and-forth, which are hard given our 5 timezones separation. Please create the following files, named exactly as shown, and stored in a zip file. Github lets you upload zip files to comments.

a.txt the sample from before. Please tell me in the comment which button you clicked on depth.
b.txt as a.txt but with the other depth style. This will let me see if there's something in the metadata that tells us.
c.txt as a.txt but click for pressure, not depth. This will tell me what the system uses for pressure. (I am assuming (dbar) but I need to know the string precisely, to match it. For example, if they put spaces around dbar I will need to code to recognize that.)
a.csv as a.txt but csv format
b.csv as b.txt but csv format
c.csv as c.txt but csv format

I think if I get these 6 files, I'll be able to code two variants, likely named read.ctd.aml.txt() and read.ctd.aml.csv(). I don't think there is any point in my trying to make oceMagic() recognize these files because I fear false positives, meaning that I try to infer these aml files from some text contained within them, but that will break other user's code because there is an already-recognized type that also happens to have that string. Many manufacturers are more clever than aml, and they insert some text at the start of files that quite clearly shows what the data are. For example, if I were coding the file format, I'd start with a line like AML BaseX2 CTD or something like that. For example, a very-popular SeaBird data format has a particular filename (ending in .cnv) and the first line starts with the string * Sea-Bird and so so it's pretty easy for oceMagic() to detect that filetype, with little chance of an accidental match. These aml files do have something that I think I could use (line 5 of the .txt file you sent and line 3 of the csv file you sent) and so if I had some confidence that these would always be in these lines, I could try to make oceMagic() work. Why is this of interest? Because, for dozens of file types for dozens of instruments, the user just has to write

d <- read.oce(FILENAME)

and it will figure out what the data are (maybe CTD, maybe ADCP,....) and then do the right thing. Then a useful plot can be created with

plot(d)

and a useful summary with

summary(d)

and so forth. Notice that these three lines do something useful regardless of the filetype. This use of "generic" functions is a real power of R.

dankelley commented 2 years ago

Followup to @ashleystanek -- it would also help if you could give us permission to include those test files into the oce test suite.

If you agree, then I will trim the data to maybe 3 data points, and then the test suite can check that we can read the data. I'll also blank out any private-looking things, like IP addresses and (if any) contact info for the operator, or what-not.

The advantage of having test files is that users will be protected against accidental errors brought about by what seems like simple code changes. We need such things for code that deals with various special cases to handle 2 different file types, 3 different vertical coordinates, and so forth.

The more gets added to read.ctd.aml(), the more we need a test suite to be sure things continue to work. To give you an idea of the scope of things, oce has about 7,000 lines of test code, which is about 10% of the code-base of about 70,000 lines of R, plus 8,000 lines of fortran/C/C++. (Test suites are one of the reasons why R packages are trusted.)

dankelley commented 2 years ago

I've put in some code to auto-recognize txt files. And, on the assumption that @ashleystanek will let us insert some (trimmed) files into our test suite, I've started writing tests. Below is an example that uses just 3 lines. My proposal would be that I insert into the test suite this file (which is already semi-public by virtue of having been uploaded to github), but with only 3 data lines. Three is enough to be sure we start reading at the right spot, that we decode columns correctly, etc.

Just to repeat the point (since I think @ashleystanek might be new to this sort of thing), the idea is that if in future the code reads any of the numbers incorrectly, then R won't pass build-test suites. That means that (at least on the tested files) oce cannot "slide back" from correct functioning. It is a sort of safeguard against introducing errors by recoding.

I don't have a test for voltage, because I'm not including voltage in the dataset. I might do that, and if I do, then I'll add a test. But the first concern is to get some test files, as requested in https://github.com/dankelley/oce/issues/1924#issuecomment-1059744336 above. I want those so I can fiddle with the code to do things like decide what to about 3 possible choices for vertical coordinate, etc.

# Preparation for tests within oce tests/testthat/test_ctd.R
# Requires local sources *or* an up-to-date oce from "develop".
library(oce)
library(testthat)
if (file.exists("~/git/oce/R/ctd.aml.R")) {
    source("~/git/oce/R/ctd.aml.R")
    source("~/git/oce/R/oce.R")
}

file <- "Custom.export.026043_2021-07-21_17-36-45.txt"
expect_equal("aml/txt", oceMagic(file))
ctd <- read.oce(file)
expect_equal(head(ctd[["temperature"]], 3),
    c(5.867, 5.986, 6.058))
expect_equal(head(ctd[["salinity"]], 3),
    c(2.61918633678843, 5.27124897467692, 8.10531077140948))
expect_equal(head(ctd[["pressure"]], 3),
    c(0.21171839076346, 0.252045727972423, 0.252045727972423))
expect_equal(head(ctd[["conductivity"]], 3),
    c(3.107, 6.01, 8.992))
expect_equal(head(ctd[["time"]], 3),
    as.POSIXct(c("2021-07-21 17:36:46.17", "2021-07-21 17:36:46.21", "2021-07-21 17:36:46.25"),
    tz="UTC"))

dankelley commented 2 years ago

I would need some convincing, to support the csv format.

Below is a snapshot of the two test files I have so far. The txt format looks superior to me, as I think it did to @richardsc. Here are some reasons:

The txt has a field called Comments: but the csv one does not. That seems like it could be a problem in some cases.2.
The txt one has times in an internationally-recognized format but the csv one has that odd American format, which is only recognizable because the day exceeds 12; otherwise we'd have to guess whether month is first or day is first. I think Americans use one order and Canadians another. But the problem of writing dates was solved long ago: the answer is to use the ISO standard yyyy-mm-dd. Maybe the csv is designed to be read by excel (which I think would have to be on a computer set up for the US locale, not the Canadian one...)
The csv one has no column names or units (although it seems that @ashleystanek has found a way to get those to appear).

PS. the ^I characters are tabs. I think the GUI permits using tabs or other schemes, so that's yet-another aspect that the code will have to handle (which I think it does already, because I use trimws to clean up metadata).

dankelley commented 2 years ago

After discussion with @richardsc (who is amassing some sample files) I've decided that I cannot find the start of data by looking for a

Comments:

line. My new plan is to count the number of commas in the first line and then scan down until I find another line with that number of commas. Whether this will be robust is anybody's guess. But we are not seeking robustness here, but rather something that will work under restricted conditions that will be detailed in the docs.

dankelley commented 2 years ago

I'm fiddling with this count-the-commas method. It works on the following test files (in https://github.com/dankelley/oce-issues/tree/main/19xx/1924)

clarks_data/Custom 025430_2022-02-23_16-18-35_export_allfields_noheader.txt
clarks_data/Custom 025430_2022-02-23_16-18-35_export_csv_allfields.txt
clarks_data/Custom 025430_2022-02-23_16-18-35_export_depth.txt
./Custom.export.026043_2021-07-21_17-36-45.txt

based on the test code 10dk.R in that directory. (This code finds files that meet the demands of the present-moment read.ctd.aml() function.)

dankelley commented 2 years ago

For fun, below is a snapshot for the first file in the list given in the previous comment. I'm graphing the difference between salinity in the file and salinity I compute from conductivity etc in the file. I'm showing it as a function of pressure.

I am inclined to retain the salinity from the file, with name "Salinity (PSU)", but to compute salinity, with name "salinity". I'm not sure if I like that scheme, though. I think we usually just discard salinities from files, on the assumption that they might be computed incorrectly, and that it's best to compute from the things actually measured, which will yield self-consistent results.

Anyway, what you see is that mostly the differences are less than 0.001 units, which makes sense because the reported salinity has 3 digits after the decimal place. (We have no way of knowing whether a manufacturer would round up or down, or just truncate.)

There are some higher differences over on the left of the graph. I'm not too sure what to make of those. My snapshot shows the data and the graph, as an extra clue. Maybe this is a result of the fact that conductivities are so low in the upper waters, so that last-digit-rounding issues are causing the scatter. I'm not going to bother with that for now, though, since I do prefer to use our own formulae for S=S(C,T,p) and so forth, given that we have many-digit tests for them in our check suite.

dankelley commented 2 years ago

Below is the density difference (computed with oce, vs stored in file). The file is named at the top of the plot. Again, I don't see any reason for further action, since the (trusted) oce values are not much different from the values in the file. I might try adding 1 to the final digits of some base quantities, to see what changes that makes.

dankelley commented 2 years ago

These tests show that altering the last-reported (i.e. 3rd) digit of C can alter the 3rd digit of S, depending on rounding. I am not seeing how to get a change of 0.005 in these tests, though.

> swSCTp(43.414,21.233,0.84,conductivityUnit="mS/cm")
[1] 30.44289
> swSCTp(43.414+0.001,21.233,0.84,conductivityUnit="mS/cm")
[1] 30.44367
> swSCTp(0.985,21.231,0.80,conductivityUnit="mS/cm")
[1] 0.5269657
> swSCTp(0.985+0.001,21.231,0.80,conductivityUnit="mS/cm")
[1] 0.5275216

richardsc commented 2 years ago

Hm, this does seem a bit odd. 0.005 is pretty large in the grand scheme of things ... I'll take a look (but should maybe do it in a new issue)

dankelley commented 2 years ago

@richardsc I agree that this 0.005 disagreement ought to be in a new issue. I got a bit lost in the files, to be honest.

@ashleystanek I'm not too sure of the advantage of csv over txt. @richardsc might have some ideas, since I know he has produced a bunch of files with different settings. I notice that none of his csv test files has header information. (Pull the github/dankelley/oce-issues repo to see his files.)

ashleystanek commented 2 years ago

I'm waiting to hear back on sharing the data file (I'm sure it wont be an issue) and will upload the corresponding files for these notes (and the comparison of text and csv files you requested earlier) at that time, but I wanted to send out this information in the meantime.

There are three files that we should consider as options for oce to read. Given what I've learned after working through this software and your questions, I'd vote for option 2. the exported csv file, but you guys are the experts so I'll leave the decision to you.

1. Original CSV - The original cast file that is downloaded to the computer upon connecting the ctd to the computer and viewing the cast in Seacast.

The main reason I would not suggest this file is that I cannot open it in Seacast later. I have to connect the ctd to the computer and load the file from there directly if I want it open in Seacast (I can view it fine in excel or a text editor). Consequently, if I ever want to view it in Seacast later, I have to save the file as another format regardless.
Whether there is a depth or pressure column depends on the main Seacast settings - if Convert dBar to depth is checked or not. Given your recommendations, I'll leave this turned off off in my program settings so that Pressure (dbar) is reported.
This was sort of what I sent you initially, except that I had opened the csv in excel and removed a chunk of the data rows. That turned out to be a mistake because it Americanized the date format and added all those commas. If I re-download this file from the ctd to my computer, the date is formatted as yyyy-mm-dd.

2. Exported CSV - the file that is generated when selecting Export As... Seacast (.csv).

I can open this file in Seacast later, if needed.
Given that there are fewer options to change, it seems harder to make errors with the formatting of this file.
The data table has a row with [data] followed by the data column names and units, and then the data. To me, this seems much easier to parse and find the data for oce to use.
Pressure vs depth column is the same as above, it depends on the main settings as above.
Date format remains in yyyy-mm-dd so long as it isn't saved in excel with a different format.
The files I was working with from our data collection this summer were exported this way, except that they were missing the column names and units. I still don't know why, but I have not been able to replicate this problem now. This was the reason I didn't send this file in my initial inquiry.
The labels in the header are in all lowercase.

3. Exported .txt - This is generated when selecting Export As... Custom

Can not be opened in seacast later.
There are options for the delimiter, order of columns, whether or not the header is present, and if the column names are in the first row.
There is no row with [data] and the column names and units are in the first row of the file (if turned on in the export settings), not the row above the dataset.
All the labels in the header are in CamelCase

Other notes:

It looks like I can add columns for salinity and density by changing the instrument settings. If they were present in the dataset, would they override the ones you calculate in making the data file a ctd object? I just saw your conversation about the test file that calculated salinity and density and getting different answers from oce.
Regarding the unit formatting, the headers seem to always be in the same format regardless of the filetype and settings that I've seen so far: Temperature is spelled with a "T", units are with "()", None of the units I have seen use any accents, but if there is a French version of the software given that it is a Canadian company, there is a potential but I haven't seen that choice. The units for conductivity are "(mS/cm)", in data files without a depth column there is a pressure column which is in "(dBar)". I haven't come across a file yet that has both pressure and depth included.

dankelley commented 2 years ago

Thanks @ashleystanek. Q: for your preferred option (number 2), could you maybe send me a private email with a sample file? (I ask because we have so many hours between our zones that tests are slow.). I won't be able to look at this on Thu and likely not on Fri either.

dankelley commented 2 years ago

The reason I want to see the file is to see the format in the [data] block. The sample files I have so far, from you and from @richardsc, don't have the column names in the [data] block. Regardless of any format's other merits, if we cannot infer the column names, we are lost.

dankelley commented 2 years ago

Re your Q If they were present in the dataset, would they override the ones you calculate in making the data file a ctd object? the answer is "no". But the user will be able to obtain salinity as in the file, if they want. For example

library(oce)
d <- read.ctd.aml("some file name")
plot(d[["salinity"]], d[["Salinity..PSU."]])

what's happening for the plot is that the x axis will be salinity as calculated by oce and the y axis will be salinity as in the data file. When naming columns, R changes odd characters (in this case, space and parentheses) into "." characters.

In the above, you are also seeing here the use of [[ as defined by oce, which has the approx. meaning "try to find the named item in the object". The cool thing is that [[ is not a direct lookup, for it will also compute things, if they can be inferred from the object's contents. For example, you could do d[["sound speed"]] and it would use a formula to compute that. If you load up an object of named d, try d[["?"]] to see what you can access, and then try names(d@data) to see the smaller list of what's actually stored in the object.

I'm not sure if you already know all this stuff, so I won't get too deep into the weeds.

dankelley commented 2 years ago

PS. on file sharing permission, I'm pretty sure I can get @richardsc to volunteer a file snippet. For testing -- which is why I want this -- we only need maybe 10 data points; more just wastes space in the package.

ashleystanek commented 2 years ago

I can give permission to can go ahead and use the files I sent for posting here and documentation as would be useful. Feel free to trim them as necessary.

dankelley commented 2 years ago

Thanks @ashleystanek. I have 2 more favours to ask:

is it OK to name you in the docs, in a line like Data file donated to oce project by [name]. where you'd specify how you want your name written. I guess you could also put your agency but frankly I would not advise that just in case some lawyer comes along some day and says that a user who got wrong results can sue you or something.
can you post a screenshot of the settings? What I'm hoping is to write up a sentence or so about the settings, and then as a test, ask @richardsc to follow those directions, as written, to ensure that he gets a file of similar format to yours.

I'm pressing things a bit because you seem to be online at this moment, and the faster I can settle the code, the faster you (and others) will have an oce function that is helpful.

ashleystanek commented 2 years ago

Thanks! I haven't actually had a chance to learn how to actually use oce since we've started this discussion. Now that I can get some data read in I'm looking forward to digging into it.

My name can be there but no need for a special note about it. But you are correct, this cast/dataset isn't meaningful as is, and hasn't been checked for accuracy (it is just being used to check for the formatting, not the content).

Below are screenshots of the instrument and program settings within Seacast. If I eventually find other options that change how the seacast csv is formatted I'll be sure to let you know. I will note that there is another csv format in the export options, as a QINSy format. It isn't the same and shouldn't be used for importing to oce. The manual for seacast is available within the program and at https://www.subseatechnologies.com/files/800/ and it describes some of the settings and formats available.

My last note for the moment is regarding the coordinates - the second set of coordinates under the [Data.x] header are the ones that should be used. This summer we had a bunch that didn't get a lock on the location, but it looks like it should be straight forward to assign coordinates to a ctd object manually. If coordinates weren't recorded, the field says "no-lock".

thumbnail_image .

dankelley commented 2 years ago

Q for @ashleystanek you said that in your "option 2" file, the column names are in lower-case. I don't see that, in the file you sent. I see as in the screenshot below. Can you clarify?

PS. I know this seems like a detail, but it's definitely not, because oce is set up for precise checks on things, e.g. at the code linked to below, I'm checking for "Date" as a precise match. Precise matches are very helpful in complex data files (although this data file is not complex, of course).

https://github.com/dankelley/oce/blob/d4f0e2782c574e400a977783ef1e39ade7020b8d/R/ctd.aml.R#L121

dankelley commented 2 years ago

Re "no-lock" on coordinates: I will code so that if there are no sensible coordinates, a NA is saved. In R, that means a missing value. With that set, some computations will not work (because the new equation of state requires location to compute density, etc... it's too complicated to explain here though).

ashleystanek commented 2 years ago

That was my mistake, all the labels for the header content are in lowercase as opposed to CamelCase in the text file. The header appears to remain consistent as you've shown across the different file types format. Thank you for checking.

dankelley commented 2 years ago

I've written code to read either Ashleys "type 2" format or one of the .txt formats from Clark.

I have also constructed a sample file (after trimming Ashley's file, and also zeroing out the IP address, serial numbers, and the WEP code).

My next step will be to add tests for this sample file. This is a very important step because it "freezes" functionality. That means that further tweaks to the code will be required to be backwards-compatible with this file format.

These things will get pushed to GH early this afternoon.

dankelley commented 2 years ago

You'll be able to learn about the dataset with

?ctd_aml.csv

dankelley commented 2 years ago

@ashleystanek please click 'Details' below to see a snapshot of the ?ctd_aml.csv docs, to see if what I've written seems OK. Note that I have zeroed out the IP address, the WEP code, and the serial numbers. Users don't need to know those things, and I don't want anybody trying to hack into your instrument.

dankelley commented 2 years ago

I have pushed to github, in "develop" commit 58331bc1ed86a1e6805f7e01f754c2a1c40e85ec. I've started some test builds but I won't know the results for a while since I have a meeting coming up.

I ask that @richardsc and @ashleystanek take a look at the docs for ?read.ctd.aml to see whether they seem to describe the format correctly. (Actually, the whole doc is only a page or so, and it would be great if you could read the whole thing.). What I want is that users will see how to set up their AML/Seacast software to generate the right sort of data.

PS. I just clicked on https://www.subseatechnologies.com/media/files/page/032e50ac/seacast-4-2-user-manual-sti.pdf and I see that it's called SeaCast and not Seacast, so I'll modify that throughout. I also plan to at least skim that manual this afternoon. I want to see if my assumptions on things like the case of "Longitude" vs "latitude" is proper, relative to the format parameter. Right now, the code is demanding that things be as I've seen in the sample files I have available, but that's a poor plan in general.

dankelley commented 2 years ago

@ashleystanek and @richardsc I see (click Details for a screen snapshot of the manual) that the AML docs say the words are lat and lon, not the full-word-form that I'm seeing in our data files.

I plan to make the code accept either short or long, and either all lower-case or title-case.

dankelley commented 2 years ago

@ashleystanek and @richardsc -- I think I have this working now, in the "develop" branch. My new test code at https://github.com/dankelley/oce-issues/blob/main/19xx/1924/12dk.R runs some files from both of you.

I'd be interested to hear whether this version works for practical applications. And, of course, I am keen to know whether my docs make sense with respect to the settings to use in the AML software.

By the way, Clark, one of your data files is stating location as just off the coast of Florida. On spring break, buddy?

PS. I will not see emails tomorrow but should be back online over the weekend.

dankelley / oce

Reading file from BaseX2 CTD #1924