dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
142 stars 42 forks source link

support Nortek ad2cp format #1219

Closed dankelley closed 5 years ago

dankelley commented 7 years ago

This will only make sense if there is sufficient user demand, because file format seems to be quite different to the Nortek format(s) that we support as of now.

EDIT: I am moving a checklist from the bottom of this thread to the top, so the status can be seen in the overview page.

richardsc commented 7 years ago

I'm definitely interested in being able to read this format. I have several files that I have just not found the time to try reading at all, and am also looking at the capabilities of the instrument for some future deployments.

richardsc commented 7 years ago

Also, note that there is currently an open issue relating to this, at #1057

I'm not sure if these "Signature" ad2cp files are the same as the ones you're looking at above, but I suspect that they are.

Note that the other issues has a link to the IMOS toolbox which can read the files: https://github.com/aodn/imos-toolbox/search?utf8=%E2%9C%93&q=ad2cp

dankelley commented 7 years ago

I'm not entirely clear on whether this format is intended for all Nortek instruments, or whether it's specific to some instruments. I suspect the former, and if that's the case, then we definitely need this. @richardsc do you have a file that you and I can share privately? I don't like handling third-party files unless that party has requested our help, which is not the case here, so far.

I prefer to err on the side of privacy whenever possible, because it's hard to know at the outset just what aspects are or will become private, so it's easy in an issue comment to say something that then cannot be "unsaid". For example, a student working on a dataset for a thesis would not appreciate the world knowing of difficulties in reading data, where maybe a transformation matrix is wrong at some stage in analysis, so someone might think the student's work is suspect. Students (and others) should be able to make mistakes privately, sharing only when they think things are ready. Otherwise we can have the climate-gate sort of issues.

racheel commented 7 years ago

I can't speak to if Nortek plans on applying this (ad2cp) file format to future instruments, but I do know that it is supposed to be used for all of their signature series. It wouldn't surprise me if they decide to use this format for future instrumentation.

If @richardsc can't easily get an open data file, I should be able to obtain/generate one quite easily. Certainly an "in-air" one any time.

dankelley commented 7 years ago

@racheel I think it would be great to get a small in-air file. Smallness helps in testing and sharing. It would also be enormously helpful if you could provide some kind of a nortek dump of the file, because that can help in seeing whether things are getting read correctly.

My impression (so far) is that the ad2cp manual is less clear then the older ones, e.g. (a) they don't list bytes by offset and (b) they don't seem to say anywhere what the overall file structure is (e.g. do we have headers interspersed with data throughout, etc). In the test file I've been looking at, the first two records of are the same code type, and that code type includes text and something else. Well, the first record is definitely text but the second is (so far) a mystery to me.

dankelley commented 7 years ago

I'm making some progress on this, but will not be posting here again until Tuesday. I suspect that we'll need to do some thinking about how to organize the data structure, given the flexibility of the data we can expect. In the test dataset provided by @racheel I see the following chunks:

  1. a single string data record (id 0xA0=160)
  2. a single burst data record (id 0x16=21)
  3. a single average data record (id 0x17=22)
  4. 8 burst data records
  5. a single average data record
  6. as number 4
  7. as number 5 etc, with a repeated sequence of (pattern 4 followed by 5).

OK, well that first one (in this case, anyway) is clearly information about the dataset. (I already knew that from direct inspection of the data). But the whole point of this setup is that the user has a lot of flexibility. So it will be important to figure out how to represent the data. It's definitely not going to be in the simple matrix form of the existing adp data structure.

The point of this comment is to get @richardsc thinking about how we might set up a structure. Perhaps he and I and @racheel can chat on Tuesday.

PS. I am not pushing my tests to GH, and nobody should be trying to pull and build until I say so. It's all on the ad2cp branch anyway, so there's no danger of normal users getting hosed by e.g. memory leaks (which are easy to do when you're fiddling with C, as I'm doing here).

PPS. I still don't have the checksums working on the data part, although they are ok on the header part.

dankelley commented 7 years ago

I think I pretty much have code to read the one test file I've been looking at. It's not in oce yet, but rather in the issue repo https://github.com/dankelley/oce-issues/tree/master/12xx/1219, in two .R files. There's lots still to do, including reading a lot of variants of a lot of data types, but I'd prefer to start with data types in files that are of interest to colleagues. And the first order of business will be to set up a suitable object format ... only once that gets figured out will this code start being inserted into oce.

dankelley commented 7 years ago

I will likely code something over the weekend. My feeling is that I'll use an event-based object model, rather than a matrix-based model. That will mean it will be totally flexible, e.g. if nortek supports having 20 different burst patterns in a dataset, or a 1000 patterns, things will still work (we won't need to set up matrices for each possible pattern). And it will handle conditional sampling, which is where I'd be going, if I were nortek.

Of course, an event-based model will break plot, because it wants to use matrices. As a workaround, I can set up subset to create matrix-based objects, for appropriate values of the second argument. (This will require some more reading of the nortek docs, to find whether there is a key buried in there somewhere that can be used to find families of data. We cannot just ask for "burst" data, for example, because even today it's possible to interleave two burst types and two average types.)

No matter what, there will be a warning in the docs, and displayed by the function, that will tell about limitations. The point of this is to get some code into oce, so that users can try doing work, and I can base the code on what comes up in real life. Ultimately, the user will not need to know about internal storage because they should use [[ to access data, etc.

richardsc commented 7 years ago

I'm not close enough to this right now to have a strong opinion, but I see the advantages of both the "event-based" model and the "matrix" model. Of course, if the accessors (e.g. [[ etc) are coded appropriately, then it should be possible to easily extract specific "event" types in a form that would be expected if they were the only data type in the instrument (e.g. for a classic adp object).

dankelley commented 7 years ago

This weekend, I will likely read the nortek doc (not just the data part) to try to learn more about how flexible things are right now. I know of a local file that has two burst modes and two averaging modes, interleaved. So that would be at least 4 matrix groups (4 matrices for velo, 4 matrices for signal strength, etc) and they would need to get sensible names or some way of naming them through metadata, etc. Also, I really suspect there is a flag somewhere in the data that indicates "families" of data, because the whole point of the new instruments seems to be that they can be a sort of hub of meta-instruments, letting one PI have a particular sampling scheme, and another PI have another, etc, with no problems so long as there are no conflicts in time (including cross-beam interference, which forces extra time limitations so the centre beam doesn't wreck the other data)

dankelley commented 7 years ago

I see from the following that the first header/data chunk will always be a textual item containing the configuration, so that wasn't just a coincidence of commonality in the test files I've seen http://www.nortek-as.com/en/knowledge-center/forum/system-integration-and-telemetry/443043193

dankelley commented 6 years ago

copied from #1057, which we are now closing...

Not likely going to be able to do this in the next few weeks, but want to document information as I find it.

I have an example file, but it's very large (~1GB). Might be good to try finding a smaller one to start.

dankelley commented 6 years ago

The discussion of this issue seems to have ended, and the developers will likely close it soon, since it seems to have been addressed. This goes against the usual oce policy of hoping that reporters will close issues, so this is really a request to the original reporter to either (a) close the issue or (b) indicate that further action is desirable.

PS. This is a standardized reply.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because two weeks have passed with no activity. Adding the 'pinned' label will prevent this from ever going stale.

dankelley commented 6 years ago

I am pinning this issue because I'd like to have a look at it again, sometime, ideally after @richardsc and I can talk about whether he has datafiles of this type.

dankelley commented 5 years ago

@richardsc, above at https://github.com/dankelley/oce/issues/1219#issuecomment-293448407, you note that you have some files in ad2cp format that you'd like to read. I just did a test (with a private datafile that I think you and I both have) on the code in https://github.com/dankelley/oce-issues/tree/master/12xx/1219 and that code at least does something, although none of this is fresh in my head anymore. So I tried some tests with oce and got as follows

> library(oce)
Loading required package: testthat
Loading required package: gsw
> oceMagic("labtestsig3_00.ad2cp")
[1] "adp/nortek/ad2cp"

so oce knows what the file type is. But oce cannot read it yet

> d<-read.oce("labtestsig3_00.ad2cp",debug=10)
read.oce("labtestsig3_00.ad2cp", ...) inferred type="adp/nortek/ad2cp"
  read.adp.nortek(...,from= 1 ,to= (missing) ...)
  fileSize: 211456612 
  first 10 bytes in file: 0xa5 0x0a 0xa0 0x10 0x89 0x0f 0xea 0xf4 0x44 0xd5
  headerSize: 10 
  ID: 0xa0 (NB: 0x15=burst data record; 0x16=avg data record; 0x17=bottom track record; 0x18=interleaved data record; 0xa0=string data record, e.g. GPS NMEA, comment from the FWRITE command
  dataSize: 3977 
  buf[1+headerSize+dataSize=3988]=0xa5 (expect 0xa5)
IMPORTANT: this code returns a list, not an adp object; see docs
Error in do_ldc_ad2cp_in_file(filename, from, to, by) : 
  argument "to" is missing, with no default

This is a very old issue, and I wonder whether I ought to do some work on it after Christmas. What do you think, Clark?

PS. by "work", I mean that the first step would be for me to look in the code ... I don't even know if any of the hard-wired stuff in https://github.com/dankelley/oce-issues/tree/master/12xx/1219 has made its way into oce ❗️

dankelley commented 5 years ago

Oh, I just read the comment I just made, and did a second test. Things look a bit better, and I'm starting to get a recollection of where we are on this -- it looks like I did some first tiny steps and then let this issue slip off my radar.

> library(oce)
Loading required package: testthat
Loading required package: gsw
> d<-read.oce("labtestsig3_00.ad2cp",from=1,to=10,by=1)
IMPORTANT: this code returns a list, not an adp object; see docs
WARNING: at cindex=10, data checksum is 18666 but it should be 62698
> plot(d)
Error in xy.coords(x, y, xlabel, ylabel, log) : 
  'x' is a list, but does not have components 'x' and 'y'
> str(d)
List of 4
 $ buf   : raw [1:211456612] a5 0a a0 10 ...
 $ index : int [1:10] 10 3997 5171 7721 8895 10069 11243 12417 13591 14765
 $ length: int [1:10] 3977 1164 2540 1164 1164 1164 1164 1164 1164 1164
 $ id    : int [1:10] 160 21 22 21 21 21 21 21 21 21
dankelley commented 5 years ago

I'll play in https://github.com/dankelley/oce-issues/tree/master/12xx/1219 first, e.g. I just made 1219c.R which tries to select some velocity profiles. But I think this file might be for an instrument turned on in the air in the lab, so it's not a great test when it comes to seeing whether the data seems to have been read correctly.

dankelley commented 5 years ago

The code in https://github.com/dankelley/oce-issues/tree/master/12xx/1219 does some tests against matlab, so I think what I'm reading there is probably correct. There are definitely some issues to sort out, but certainly I think it's time for me to make read.adp.ad2cp return an oce object, not just a list. I'll do some of that today. Since the whole oce idea is that users shouldn't have to worry about exactly what is in an object (because [[, plot, subset and so forth are supposed to do that worrying), I may take some of the code from https://github.com/dankelley/oce-issues/tree/master/12xx/1219/1219c.R and put it into the plot generic. I see that I have a branch called ad2cp, so I'll first update that, and put some of this there ... but I hate having complicated sidebranches so I may port it into the develop branch at any moment, because no users ought to be trying this code anyway.

dankelley commented 5 years ago

I have a provisional form of [[ working (see https://github.com/dankelley/oce-issues/tree/master/12xx/1219/1219d.R#L26). This constructs a matrix from a subset of the list that is stored as x@data$v. At https://github.com/dankelley/oce/blob/ad2cp/tests/testthat/test_ad2cp.R (i.e. a test file in the "ad2cp" branch), you can see this in action, and also see how using oceGetData(d,"v") can be used to retrieve the whole list.

I am quite uncertain on the best way to frame the UI, because this data type can hold several streams, and I think we need a way to make it easy for users to extract different streams. So, I need to read a bit more about the format, but also talk with some people who are using these machines in practice, to find out what users might like to do. It would be mistake to construct a UI that makes it easy to do things that users won't want, and hard to do the things they do want!

dankelley commented 5 years ago

Preview: I seek advice from anyone familiar with AD2CP data formats...

Based on personal communication with @racheel, I now have the impression that, within a given AD2CP data file, all 'burst' data will share a certain geometry (e.g. bin sizes, etc), and all 'average' data will share another (or possibly the same) geometry.

If true, this has implications for the storage within R objects, and also for the UI of data extraction and plotting. For example, if all bursts are similar within a given file, I would be able to create a list within the data slot for burst data, and another one for average data, etc., and that would simplify a lot of the coding for plots, data extraction, object summaries, object subsets, etc. But I will need to know for certain that this is true, before I'll be comfortable in changing the internal structure that I've cobbled together so far, which is based on a list of heterogeneous data-record types, interleaved in an order that stems from order within the data file (i.e. time).

So, if anyone reading this knows where in nortek docs I can find a statement about all bursts within a given data file being of similar geometrical properties (and the same for averages), please let me know. I'll also be doing some reading of nortek docs, and will post a new comment here if I get a conclusion, so that anyone following the issue will see that this has been settled.

Thanks!

dankelley commented 5 years ago

No, I am wrong on the idea that are bursts are alike within a given file. In the file N3015-002 SignatureOperatoins2505001000C.pdf (I'd give a link but I don't know it ... this is just a nortek file that I happen to have available), in Figure 13, reproduced below, it's clear that simply knowing whether a data record is 'burst' or 'average' is insufficient. We need to know also which 'plan' it is within. The diagram shows 2 plans, but I don't know whether the instruments permit more than 2 plans at the moment (and am guessing that 3+ will come at some time).

Therefore, I cannot just make a tree like @data$burst for burst-related data and @data$average for average-related data.

This is tricky because I don't think I've seen in the integrator guide any part of the data stream that says which 'plan' a data record belongs to.

screen shot 2018-11-23 at 4 16 12 pm
dankelley commented 5 years ago

In the file N3015-002 SignatureOperatoins2505001000C.pdf I see as I've pasted at the end of this comment. This leads me to believe, as of now, that AD2CP files can contain only two "plan"s. And, from the comment just above, I think (but am not certain) that any single "plan" cannot have more than 2 burst schemes or more than 2 average schemes.

Therefore, I think I can set up the R object data slot as e.g.

data <- list(plan = list(burst=list(), average=list()),
    plan1 = list(burst=list(), average=list())

which is hard-coded for plan name, as stated in the nortek docs. Maybe the better would be e.g.

data<-list(plan=list())
data$plan[[1]]<-list(burst=list(),average=list())
data$plan[[2]]<-list(burst=list(),average=list())

This second scheme seems more sensible to me, and we don't have to worry about users having to remember to type all the brackets because they should be using e.g. [[ instead of going directly into data structures.

Now, some words on how users might access. Perhaps they could do e.g.

d[["time", "plan/burst"]] # burst data in first plan
d[["time", "plan1/burst"]] # burst data in second plan
d[["time", "plan/average"]] # average data in first plan
d[["time", "plan1/average"]] # average data in second plan

and similar for "v" and "distance" accesses. Defaults could be set up for e.g. a file with only a single plan and only either burst or average mode. All of this would have to work also for plotting, which might be quite a lot trickier to code (owing mainly to the wordiness of that code).

I suppose another approach would be to leave the accessors and plotting as they are, and simply to teach subset() how to isolate one "plan" from another, and to select either "average" or "burst" data streams. This would be much simpler to code (which has advantages in terms of bug generation). I think it might also be easier for the user, e.g. switching an analysis from burst to average would mean changing just a single subset() call, not possibly a lot of [[ and plot calls.

I am partly thinking by typing, here. As is often the case, the last thing I type is close to my preference, or, put another way, is my proposal to co-developers and other interested parties.

Thoughts from others?

screen shot 2018-11-24 at 9 19 20 am
dankelley commented 5 years ago

I've got myself turned around on the 'blanking' in the test file. I wonder if someone who has that file, and can run matlab on it, can tell me the value of 'blanking' in this file, for the 'burst' type and for the 'average' type?

racheel commented 5 years ago

The burst type has a blanking distance of 2.8, all other modes in this file are 0.1

dankelley commented 5 years ago

To co-devels, RC and I just had a conversation, and I have some new (private) test data. I'll likely return to this after a few days. My tasks will be as listed in the checklist that has been added to the originating comment of this thread.

dankelley commented 5 years ago

The "ad2cp" branch now handles this format, at last provisionally. It is well-tested, using matlab as a reference, for a particular file in my possession, and I think this shows that the average and burst data-types are handled properly. I also have two other test files, one with only burst data and the other with both burst and interleavedBurst records, but in these two test cases, I do not have matlab output with which to compare, and can only say that the reading proceeds without error.

The code is only minimally documented, because I wanted to play with various syntaxes, first. The file tests/testthat/test_ad2cp.R demonstrates how to use it.

I would like to get some more test files before closing this issue. That will require some face-to-face discussion with people who know more about this, so I think it will be delayed until sometime in January, perhaps a day when CR is at Dalhousie. I would particularly like to know whether there are any echosounder datasets that I could use for testing, because the Nortek docs are not very clear on that data type. (Really, they are not very clear on quite a lot of things. For example, can a data record that is of type 'altimeter' also have velocity, amplitude, and quality? I cannot find anything in the docs on this sort of issue, but RC has shown me the setup GUI, and some combinations are possible.)

dankelley commented 5 years ago

I feel that this "ad2cp" branch has been isolated for long enough. Therefore, my plan is to clean up the read.adp.ad2cp docs a bit today, and to merge "ad2cp" into "develop" tomorrow morning. We have a pretty good test suite against matlab-inferred values, and so I am reasonably sure that read.adp.ad2cp works at least on files I have available. As for other types of files, e.g. those with echosounder data in the stream, I don't see any reason for updates to sit in the "ad2cp" branch as opposed to the "develop" branch, because they will of course be isolated within read.adp.ad2cp() and therefore won't affect e.g. people working with RDI data.

Naturally, if another developer gives a thumbs-down or a full comment asking me not to merge tomorrow, I'll hold off (for as long as they want -- this is more a spring-cleaning exercise than something I need in my own work).

richardsc commented 5 years ago

I have not had a chance to play with this, but I trust the work you've done to consider it "stable enough" for merging in to develop.

dankelley commented 5 years ago

I've done this (see below) and am closing the issue because it (as defined by the title) has been addressed. Bugs and new features belong in new issues.

commit 2456d89ad314bc4c990971c3f74c59b1ee73d17b Merge: 8c7f32bb 1731c9a1 Author: dankelley kelley.dan@gmail.com Date: Sat Feb 9 12:42:17 2019 -0400

Merge branch 'ad2cp' into develop