Closed jessecusack closed 3 years ago
A question for either @jessecusack or @richardsc: can you supply a sample file or, if you are @richardsc, can you assign yourself to this issue and take care of it?
I know we have likely corresponded before about sample data, but I do not keep old emails, and I tend to delete files once issues are closed, for data privacy. Pairing datasets with issues is the first step to solving those issues.
The code change is likely to be simple, and I'm very happy to do it, but it's very hard to work without a sample dataset. We are talking about hundreds (maybe thousands) of lines of code that might be activated by a given dataset. Finding the 3 or so lines of relevant code is obviously a lot easier with a sample file, since then I can insert tracing code to discover the flow of control as the code is run.
Many thanks!
I think this may work now in the "develop" branch, commit d08fce190afc2121ff400784e84dce219637711f
Below is a reprex, created with reprex::reprex()
using the built-in dataset, which does not have a vertical beam. Perhaps @richardsc can take a peak at the code diffs (and @jessecusack can see them too, by clicking on the link at the end of the previous paragraph) to see if what I've done makes sense. Note that I have also altered some nearby existing code, to use the same scheme (which uses fewer lines and should be easier to maintain). Because of this alteration, the new fifth-beam code is also used for the other beams, and that makes me think my solution might be correct ... but only an actual sentinel-V test will confirm that.
@dankelley I installed the develop branch and did a quick subset test. It now appears to work as expected! Great!
I can ask if the dataset I'm working with could be used as a sample (I am not the owner), however it is currently an active research project so unlikely to be made public soon. It is also rather large (600 MB).
Oh, thanks -- I won't need a dataset for this issue, if you think the fix has worked.
And if you think that it has worked, please close the issue -- that's the convention we have in oce.
Thanks for the quick response!
I did a few more checks with subset that justify reopening this issue.
Subsetting by distance e.g.
subset(adp, distance < 100)
does not subset the vertical beam data. However, vertical beam data seem to be linked with another distance variable vdistance
. For my dataset vdistance
and distance
are not identical (but are very close to identical, one is offset from the other by about 3 cm).
Subsetting by vdistance
does not throw an error, but instead just subsets by distance... perhaps something to do with how the code interprets the logic string? I'm not sure, but it seems like a bug.
I think several different behaviours might be reasonable. One would be for subsetting by distance to also apply to the vertical beam data, but using instead the vdistance
variable. The other would be to require a second use of subset, applied to vdistance
which only changes the vertical data e.g. vv
etc. I don't know which is better.
All of these subsets are individually coded, so I'm not surprised that we have continuing problems. I see why there is a problem with "distance"
, because the code we have for recognizing a subset by distance (see https://github.com/dankelley/oce/blob/1da5964623d48c9b10b64218e4b815b85a0061a3/R/adp.R#L1058) is searching for the string "distance"
, and that also captures "vdistance"
. I can fix that by checking for "vdistance"
first, in a block above this one.
Any chance you could post a comment in which you display the "distance"
and "vdistance"
data? Just do e.g.
d <- read.oce("your data file")
d[["distance"]]
d[["vdistance"]]
in a reprex. (Yes, really, please use reprex::reprex()
since it makes things a whole lot easier, even if I don't have a data file. Otherwise, when a bug is reported, the code authors have to also write code to make the error happen.)
This is going to be a bit slow, without local test data. I hope you can be patient. Please build from github, and try running the distance
subset. It should print a bunch of stuff out.
Please examine code screenshot below (which is the code you'll be using after your rebuild from github). If control gets to this spot, then we'll know that this is where I can fix your problem. If not, I'll need to add more print statements throughout the code to try to discover how the problem flow is working. (This R file is 4000 lines long. The file that reads your data is 2000 lines long. There are lots of ways control can meander through such code terrain and the only way I can discover the flow remotely is by printing.)
Notice that line 1078 prints a line you can easily recognize. I'm keen to see the lines above that. This will tell me (a) that I am catching the condition and (b) the vdistance
values. I also need to see the distance
values (see previous comment). From these things I ought to be able to get subset by distance working. It should likely only take a few more tests like this, so, at perhaps half a day per test, we might get a solution by week's end.
I've got kids in the shower and am also trying to clean the kitchen, but just remembered I have some SentinelV data. Hopefully this helps/works for what you need. I'll try and build/run/debug later this evening if I can.
Got it -- thanks, Clark. I'll look this morning.
If I try for subset(adp, distance)
, that is without making the user supply a vdistance
variant, I'll need to figure out how to alter expressions, for the tricky business of evaluating the stated subset partly in the user's context, so e.g. if they have a variable d0
defined, they can do subset(adp,distance<d0)
, and also in the context of the adp
object. This is the work done in lines like
keep <- eval(expr=substitute(expr=subset, env=environment()), envir=x@data, enclos=parent.frame(2))
which are a bit tricky, e.g. they changed something in R a couple of years ago that forced me to change all those calls. (There are 35 such lines, at least 1 for each data class.)
I do think that a one-step process will be simpler for the user, but if this turns out to be a hard coding task, I'll fall back to a two-step one, where the user is forced to subset on distance
and vdistance
separately.
Hm. Here is what I get for Clark's file. Note that read.oce()
is not reading vdistance
for that file.
@jessecusack can you supply a reprex that indicates to me why you know that vdistance
is present in the object as you read it? (Just read it, and use names()
twice, like I've done below.). I want to verify that before I can decide what to do next. It may be that the file Clark sent is from a slightly different format, so read.oce()
is not reading vdistance
. Or maybe, @jessecusack, you used read.adp()
? Or, maybe you used read.adp.rdi()
? As you can see, providing a reprex prevents a lot of questions later. If I can get a file that is suitable, this should not be a hard thing to fix, I think. But I need such a file, otherwise it's a matter of my making a guess, waiting a day or two for a test, then making a new test, etc.
Note that we don't need (or want) a full file. Just 1M or so should be enough to get several profiles. Likely they are in the air, and therefore not even useful values. Unix and unix-like machines have nice ways to chop out part of a file.
Meantime, I will alter the code from what I did yesterday, because it will break on lots of files, the way it is now. (I was guessing that if vv
was defined, then vdistance
would be defined. Not so.)
setwd("~/git/oce-issues/18xx/1837")
library(oce)
#> Loading required package: gsw
#> Loading required package: testthat
#> Loading required package: sf
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
if (file.exists("~/git/oce/R/adp.R"))
source("~/git/oce/R/adp.R")
if (!file.exists("sentinel_adcp_example.pd0"))
stop("cannot test this, since 'sentinel_adcp_example.pd0' is not present")
d <- read.oce("sentinel_adcp_example.pd0")
#> Got to end of data while trying to read an RDI file (cindex=998294; last7f7f=998290)
#> Warning in read.adp.rdi(file, processingLog = processingLog, ...): skipping the first ensemble (a temporary solution that eases reading of SentinelV files)
#> Warning in read.adp.rdi(file, processingLog = processingLog, ...): A list of unhandled segment codes follows. Several Teledyne RDI manuals
#> describe such codes; see e.g. Table 33 of Teledyne RD Instruments, 2014.
#> Ocean Surveyor/Ocean Observer Technical Manual.
#> P/N 95A-6012-00 April 2014 (OS_TM_Apr14.pdf)
#> Code 0x01 0x0f occurred 444 times
#> Code 0x00 0x70 occurred 444 times
#> Code 0x01 0x70 occurred 444 times
#> Code 0x02 0x70 occurred 444 times
#> Code 0x00 0x32 occurred 444 times
#> Code 0x04 0x70 occurred 444 times
#> Warning in read.adp.rdi(file, processingLog = processingLog, ...): length(time)=444 exceeds length(profileStart)=443 so trimming time
#> length(time)=444 exceeds length(profileStart)=443 so trimming time
cat(vectorShow(d[["distance"]])) # 95 elements, highest value 95.92
#> d[["distance"]] [1:95]: 1.92, 2.92, ..., 94.92, 95.92
# Q: where is 'vdistance'?
sort(names(d@data))
#> [1] "a" "ambientTemp" "attitude"
#> [4] "attitudeTemp" "contaminationSensor" "depth"
#> [7] "distance" "g" "heading"
#> [10] "headingStd" "pitch" "pitchStd"
#> [13] "pressure" "pressureMinus" "pressurePlus"
#> [16] "pressureStd" "q" "roll"
#> [19] "rollStd" "salinity" "soundSpeed"
#> [22] "temperature" "time" "v"
#> [25] "va" "vdistance" "vg"
#> [28] "vq" "vv" "xmitCurrent"
#> [31] "xmitVoltage"
sort(names(d@metadata))
#> [1] "beamAngle" "beamConfig" "beamPattern"
#> [4] "bin1Distance" "binMappingUsed" "bytesPerEnsemble"
#> [7] "cellSize" "codes" "coordTransform"
#> [10] "cpuBoardSerialNumber" "dataOffset" "depthMean"
#> [13] "ensembleInFile" "ensembleNumber" "errorVelocityMaximum"
#> [16] "falseTargetThresh" "filename" "firmwareVersion"
#> [19] "firmwareVersionMajor" "firmwareVersionMinor" "flags"
#> [22] "frequency" "haveActualData" "headingAlignment"
#> [25] "headingBias" "instrumentSubtype" "instrumentType"
#> [28] "latitude" "longitude" "lowCorrThresh"
#> [31] "manufacturer" "numberOfBeams" "numberOfCells"
#> [34] "numberOfCodeReps" "numberOfDataTypes" "numberOfSamples"
#> [37] "oceBeamUnspreaded" "oceCoordinate" "orientation"
#> [40] "originalCoordinate" "percentGdMinimum" "pingsPerEnsemble"
#> [43] "profilingMode" "sensorsAvailable" "sensorSource"
#> [46] "serialNumber" "systemBandwidth" "systemConfiguration"
#> [49] "threeBeamUsed" "tiltUsed" "transducerDepth"
#> [52] "transformationMatrix" "transmitLagDistance" "units"
#> [55] "vBeamHeader" "velocityMaximum" "velocityResolution"
#> [58] "wpRefLayerAverage" "xmitPulseLength"
# Next fails, as there is no 'vdistance'
ds <- subset(d, distance < 50)
#> should handle va, vg, vq and vv now. But we need to debug this first...
#> next is vdistance:
#> Error in print(vdistance): object 'vdistance' not found
Created on 2021-06-15 by the reprex package (v2.0.0)
Oh, I take that back. I was looking in the wrong place for vdistance
. NOTE: on Clark's test file, vdistance
is identical to distance
, except for a 4.00cm difference. And that's on a range that goes up to 95.92. Therefore, if I cannot figure out the trick to use vdistance
, I would not be too bothered about using distance
.
I think I have this working in "develop" commit 36fcedcfa94c5218f8f8cd3f6ea643bd4227a7bb; see the test code by clicking 'Details' below. I have not looked in great detail at the output, but at least the summary()
reveals that the dimensions are correct, and the code is relatively self-evident (at least to those who understand eval()
, substitute()
, environments, etc.).
'
to a "
, because I use regexps to find things and I search for double-quoted strings.)PS. Following my usual habit, I've retained some commented-out code that I'll delete later. (I like to keep such things to save time, if the modifications don't prove successful.)
@dankelley I tested your latest changes. It looks like subsetting by distance and pressure is now working correctly. The one remaining thing is all the other metadata variables like xmitCurrent
etc. are left unaltered by the subset, is this intended behaviour?
Reprex below.
Thanks for the reprex, and the question. I'm a bit too tired (as a morning person in an eastern timezone) to go into the code again. I am not actually too sure what a lot of the other things are, to be frank.
Your comment has two issues, I think, one for subset by pressure and another for subset by distance. These are very different things, of course, and when I look at this tomorrow I'll try to see what's going on in each.
In the meantime, my notes are as follows ... restricted to the subset-by-pressure problem.
I don't think this will take long to fix. I'll be working with the test file that Clark provided, so I can have a think-edit-check cycle that is measured in seconds, not hours.
I see the problem: I am working through data items that were known to us when we wrote the original code, many years ago. The snapshot illustrates. I'll change it so that I go through all the items in the data
slot. I may need to think a bit about this, because e.g. suppose we have N
profiles. For a subset by pressure, I create a logical called keep
of length N
. But we can only apply that to things that are keyed to profile. I think most things are, but I'm not completely sure of that. I think what I'll do is e.g.
This is for pressure
subsets. A similar thing will be needed for a time
subset (and there are likely other subsets that are similar).
Once I get them done, I'll look at the distance
subsets.
I have tackled the subset-by-pressure case in commit 94ebb08dcb8d94778ebf270f7c51ac902f17807a of the "develop" branch. A test code is 1837a.R at https://github.com/dankelley/oce-issues/tree/main/18xx/1837, and it's results are shown in the Details below. (Readers should git-clone the oce-issues repository, so they can reproduce this test.)
I will look at subset-by-distance next. (Here, as so often, I really wish GH had threaded comments, but I ask that readers use thumbs-up or thumbs-down here, to indicate whether they think this worked. If you make a new issue referring to this, please include a link to this comment, so at least readers can see what you're referring to.)
NOTE: the new code is heavily instrumented with oceDebug() calls, so using e.g. subset(d,pressure<median(d[["pressure"]], debug=1) will produce information on each data item, as it is examined.
In the Details, notice that wherever 443 appears in the summary for 'd', 188 appears in the summary for 'dsp'. If that were not true, we'd know that the subset() was failing. (By logic, the reverse need not be true, but examination of the code by clicking the link at the top of this issue comment should help readers to judge the correctness of the update.)
Oh, hang on. Now that I'm looking at subset-by-distance, I see that these data can have e.g. @metadata$flags$v
, so I'll need to assure that this works throughout. (Lots of this code is over a decade old, and so the data structure is not in my head!)
PS. I have asked @richardsc for permission to put his sample file into the oce repo (but not the CRAN tarball) so that I can hardwire tests on things like this. (A code test is self-evident, unlike one of dozens of comments in one of hundreds of closed issues...)
Oh, hang on^2. I see that my test dataset does not have any @metadata$flags
content, so I won't be able to test with that. Therefore, I am not going to bother with this ... non test, nulla codice
I think things work now for subset-by-pressure (see Details)
I think things work now for subset-by-distance (see Details)
The test suite now incorporates the tests that I showed in the previous two issue comments. These tests will not be done in CRAN, but this should not matter, because they will be activated whenever a developer builds and tests oce from the repo source.
This issue has been automatically marked as stale because no comments have been made in two weeks. Unless comments are made in the next two weeks (or the 'pinned' label is applied), this issue will be closed automatically in a week. Please note that this is not a reflection on the quality or importance of the issue, but merely a reminder of the march of time. A simple comment is all it takes to keep this issue from being closed automatically, two weeks from now.
Term is approaching and I'd like to clean things up. From a quick look at the comments, it seems that this issue has been addressed.
Perhaps, @jessecusack or @richardsc would consider either closing the issue or adding comments indicating what remains to be done?
Subsetting now works for 5th beam data and all the other variables in a sentinal V dataset, so I'm happy to close the issue! Thanks @dankelley for all the time and effort you put in!
Short summary of problem
The subset function doesn't apply to the vertical beam data from sentinel V ADCP such as
vv
,vq
,va
etc.How urgent is this?
Not very...
Reproducible Example