dankelley / oce

R package for oceanographic processing
http://dankelley.github.io/oce/
GNU General Public License v3.0
140 stars 42 forks source link

Support for .arg raw files from Multi-Cell Doppler Current Profilers #1637

Closed hhourston closed 4 years ago

hhourston commented 4 years ago

I have a raw ADCP file in .arg format that came from a Multi-Cell Doppler Current Profiler, which I attempted to read into R using read.adp(). An adp object would not created -- I received an error message saying the file type is unknown. Is support for .arg files something that may be implemented in the future?

This concerns archived data so this is not an urgent issue.

dankelley commented 4 years ago

I've not heard of this file type before. If

  1. other users are also interested in the format (measured by thumbs-up icons below, and possibly by comments as well, especially if they are the lines of "I need this for my work")
  2. you have docs specifying the file format
  3. you have a sample file you can share
  4. the oce coding seems not be too arduous

then it could be considered. There are several high-priority items for 'oce' slated for the solstice holiday, though.

richardsc commented 4 years ago

Hi Hana,

As Dan said, a manual that outlines the data format and a sample file would help (email to kelley.dan@gmail.com and clark.richards@gmail.com).

A quick google suggests that it is probably a Sontek ADP, like this one:

https://www.sontek.com/argonaut-xr

(though maybe an older model).

I also found a potential manual (though not on the Sontek site ...) here:

https://eng.ucmerced.edu/snsjho/files/San_Joaquin/Sensors_and_Loggers/SonTek/SonTek_Argonaut/ArgonautXR.pdf

Page 87 has the binary data format.

However, because of a past project we actually did write code to read one type of Sontek ADP. I don't know if it was the same kind (pretty sure it wasn't "multi-cell"), and I don't know if the file format is the same.

You could try a more direct approach by doing:

d <- read.adp.sontek(...)

to see if that gets you anywhere.

dankelley commented 4 years ago

The existing code will not handle this instrument. This is because at https://github.com/dankelley/oce/blob/4e59dc0b5eeb6cac3ffc9c26e85cccd9a5c61a59/R/adp.sontek.R#L104 we require byte sequence 0x10 0x02 at the start of the file, but the docs that Clark so kindly pointe out indicate that this file type starts with 0x40 0x02; see page 87 of that documentation, under the heading 'server configuration structure'.

I've not looked at the rest of the docs, and it seems possible that read.adp.sontek() could be made to work, after changing this and other things. The first few entries following 0x40 0x02 look similar to what we have, so the first thing I'd try (once we have a data file) is permitting 0x40 0x02 at the start, to see how far our code can get. I am heartened by the fact that both the existing code and the documentation that Clark found say the first block is 96 bytes, so perhaps a lot of things are the same. But the table covers 2+ pages in the docs, so there is a lot of stuff to get checked line by line, and possibly modified. And then I think it's possible that we have deeper issues, with things like beam transformation matrix, that might make it hard for existing code to handle this format.

It could be a couple of hours of work, or a couple of days. We'd need a sample file to say more. And, of course, we'd also need sontek docs that match the sample file, because it would be counter-productive to do a lot of coding only to find that the data differ and that we have no pertinent documentation.

hhourston commented 4 years ago

I just sent an email containing the file. The file did indeed come from a Sontek Argonaut XR instrument, so the manual Clark found looks right- I have been unsuccessful so far with finding a copy of the manual otherwise.

I will also confirm that read.adp.sontek(...) did not work with the file.

dankelley commented 4 years ago

I tried making read.adp.sontek() permit first byte being 0x40 (in addition to the present 0x10) and it gets somewhere at least. I'll into this some more tomorrow. (I renamed your file for privacy, and have redacted your serial number for the same reason.) A few things to check:

> library(oce)
Loading required package: gsw
Loading required package: testthat
> d<-read.adp.sontek("~/Dropbox/oce-issue-1637/file.arg",debug=3)
    read.adp.sontek(...,from= 1 ,to= (missing) ,by= 1 type= adp pcadp ...)
    file /Users/kelley/Dropbox/oce-issue-1637/file.arg has 3013196 bytes
    have a header, but ignoring it for now
    serialNumber= (redacted) 
    adp.type= 18 
    frequency= 
    nbeams= 3 
    beam.geometry= 1 ; 0 (2 beams); 1 (3 beams), 2 (4 beams with 1 vertical), 3 (4 beams, Janus)
    slant.angle= 25 
    orientation= up 
Error in do_ldc_sontek_adp(buf, 0, 0, 0, 1, -1) : 
  cannot determine #beams or #cells, based on first 1000 bytes in buffer
hhourston commented 4 years ago

I was unable to find the beam angle and number of beams in the text file as well, although from pictures of the instrument it looks like it has 3 beams. I'm currently waiting to get the Sontek viewing software ViewArgonaut downloaded onto my computer which will hopefully allow me to confirm that information. I also checked the manual Clark shared and while it listed the number of beams and beam angle for Argonaut SL and ADV instruments, it did not for XR instruments, unfortunately.

richardsc commented 4 years ago

Looks to me like the Argonaut-XR is the newer version of the Sontek ADP that read.adp.sontek() was written to read (10 years ago!). From looking at the product online I think that it has 3 beams only. I haven't found documentation of a beam angle, but 25 seems reasonable to me, since it is designed for relatively shallow water.

dankelley commented 4 years ago

Hm, Clark's comment about argonaut-XR being a newer version of an instrument we used before makes me think that oce ought to handle this new format. With the two user manuals open side-by-side, I can see that it should feasible to support argonaut-XR. (I sure wish Sontek would give byte counts in their docs, though, as other manufacturers do, so I didn't have to count bytes. Basically, I think all the manufacturers are copying RDI equipment, and it would have been nice if they had also copied RDI's way of documenting formats...)

I've made a new branch called "argonaut", and I'll fiddle with that over the solstice break ... speaking of which, below is today's analemma from http://emit.phys.ocean.dal.ca/~kelley/analemma/ analemma

dankelley commented 4 years ago

(Question for @hhourston at the end. I don't want to get too far into the weeds without checking.)

Hm, I'm starting to code the header recognition and getting a problem near the start. The screenshot shows this. According to the docs I have (dated 2001), th "system type" is a single character (i.e. byte) that should have low nibble either 0 or 1, and high nibble either 0, 1, or 2.

As a reminder, a nibble is half a byte, or 4 bits. The snapshot shows the as e.g. 00 for 0 and 01 for 1, so the nibbles I see are 0100 and 1000. So maybe their notation is reversing the bits within the nibbles. Sheesh, I dunno. I would read those as 4 and 16, but those do not fit th allowed pattern (in my first paragraph) so I guess we need to read them in reverse order, i.e. we get

0100 -> 0010 -> 2
1000 -> 0001 -> 1

which, given paragraph 1 here means that the first nibble must be the "high" nibble. So that means the system type is SL and the frequency is 1.5MHz.

Question for @hhourston do you know if this is type SL, and freq 1.5MHz?

Screen Shot 2019-12-21 at 11 40 28 AM
dankelley commented 4 years ago

Oh, I'm an idiot. I was looking at the docs that @richardsc told us about at https://github.com/dankelley/oce/issues/1637#issuecomment-567737667 but those are for an argo adv, not an adp.

So most of what I've said above is likely wrong.

I'm back to square 1, and I'll ask @hhourston: if you are looking at these data in some official way, then I imagine someone must have the manual at your site. If you can get in PDF format, please send. Or if it's only in paper format, please scan and send. We won't be able to make any progress without a manual that reveals the data format, byte by byte.

Two bytes forward, one byte back -- that's how I'll assess the progress.

richardsc commented 4 years ago

Hm. I find it weird that in searching I can't find an "official" manual for the Argonaut-XR, but I am 99% sure that the manual I linked to is for an ADP and not an ADV.

For one thing, an ADV functions using a completely different geometry that is not "beam" based like an ADP (different transmit and receive transducers).

What makes you think that manual is wrong?

dankelley commented 4 years ago

Hm, let me look at the manual again (page refs are the PDF pages, not the print pages)

  1. p7: the title page of th doc says "current meter"
  2. p107 (print page 89): the vel is stored in 6 bytes and, indeed, the data structure is of fixed size

It was the number 2 that gave me the clue. I was working through the docs, decoding this byte and then that byte, when I got to that part. Then I looked in more detail and actually read the cover page, instead of just noticing the word 'argonaut'.

I tried googling around to try to find docs, but nothing turned up. (A lot of the returned results were for oce, actually.) That is why I'm hoping that @hhourston can find a document; whoever bought the machine likely has one.

dankelley commented 4 years ago

Hm, unless the CellBegin and CellEnd mean there is one data structure per cell. But then I'm left with the confusing feature that the file contains only 10 or so pairs of 0xB0 0x26 (for a 'LONG' record) or 0xB1 0x16 (for a 'SHORT' one).

Oh, and also the 'SHORT' one does not have a CellBegin and CellEnd, so my theory about one-record-per-cell does not make sense.

And the title page ... that's a kicker.

You can see my byte detection in the "argonaut" branch, commit b8081e42ef08a5a88b0294e4ec399208ff130716; maybe my quick coding is wrong on the byte detection. (Note that you find the second header byte by using as.raw on the record byte count.)

dankelley commented 4 years ago

I thought it might be interesting to analyze the bytes in the file. I'll let my code and the results speak for themselves. I am looking at the mean and std-dev between recurrences of all the possible bytes in a binary file. If data chunks are of equal length, then the starting bytes ought to show a low std-dev of distance.

Code

sink("~/bytes.txt")
for (i in 1:255) {
    J<-which(BUF==as.raw(i))
    cat(sprintf("%3d 0x%02x %6.1f %6.1f\n",
        i,i,mean(diff(J)),sd(diff(J))))
}
sink()
d <- read.table("~/bytes.txt", header=FALSE, 
                col.names=c("i","byte","mean","sd"))
plot(d$i, d$sd, type="s")

Results

image

Snippets of the file:

 48 0x30  197.0  305.6
 49 0x31  116.1  179.1
 50 0x32  101.7  157.4
 51 0x33  118.4  197.9
 52 0x34  147.7  227.6

159 0x9f  748.8  875.9
160 0xa0  794.8  923.0
161 0xa1  152.7  156.0
162 0xa2  869.8 1010.0
163 0xa3  822.2  925.6

174 0xae  989.7 1165.6
175 0xaf  959.9 1067.4
176 0xb0  158.2  161.6
177 0xb1 1078.7 1267.0
178 0xb2 1123.4 1273.8

My thoughts

The bytes with the biggest local anomaly in terms of sd(diff) are 0xa1 and 0xb0. Also, 0x32 is low. Note that argonaut ADV (not ADP ... we know nothing about ADP) velocity records have a trigger byte of 0xb0. This might be an indication that a trigger byte in this file is also 0xb0.

However, to identify data segments, we need to know the byte after the trigger, and then we need to know where the checksum is after that. So the test I'm reporting on here is mainly for interest, and proves little more than that the bytes are not white noise.

dankelley commented 4 years ago

Oh, and also

> J <- which(BUF==as.raw(0xb0))
> hist(as.integer(BUF[J+1]),255)
> summary(as.integer(BUF[J+1]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   161.0   161.0   156.8   161.0   255.0 
> median(diff(J))
[1] 161

image

This is pretty good evidence that a key byte is 0xb0, and that the byte after is the length of the data record. That still does not tell us what's in all 161 bytes, though. I think that two will be the checksum code, and two will be the actual checksum. So, that means we have 155 or so bytes to be guessed at. Not promising!

PS. doing similar for 0xa1 gives a hair-comb sort of histogram, which is neat to look at but that doesn't mean much to me, so I'm not including it.

dankelley commented 4 years ago

Hm, the manual I got from sontek was for an argonaut current meter (like the manual @richardsc found). But the first comment in this issue called it a "multicell doppler current profiler", and if that's right, then I am back at square -1 (not yet at square 0, I reckon!).

I guess I ought to ask Sontek again for a manual. I had asked for information on "argonaut-xr" files. I wonder whether that's the right word?

I really don't think that this is what's described in the manual that Clark sent (or that I got from Sontek, which seems to be the same). I say that because e.g.

> buf<-readBin("~/Dropbox/oce-issue-1637/file.arg","raw",500000)
> is0xb1<-which(buf==0xb1)
> is0xb0<-which(buf==0xb0)
> length(buf) / length(is0xb1)
[1] 1152.074
> length(buf) / length(is0xb0)
[1] 161.4466

shows that we do not have what's expected for data chunks, in this 500K file fragment. (If we had such chunks, we would get numbers like 22 for the 0xb1 case and 38 for the 0xb0 case, since those are the number of bytes in the "short" and "long" data segments, coded by 0xb1 and 0xb0. Also, if I look for cases where these magic bytes are followed by the expected values (i.e. 0xb0 followed by 0x26 or 0xb1 followed by 0x16) then I get under like 0 or 1 hits (I didn't bother keeping track, because this file just clearly is formatted differently than in the docs.

dankelley commented 4 years ago

I've added code ("argonaut" branch commit f36061ff9bf6911a8fb3ccbc3ee953faf59f94f8) to read a fair fraction of the header info. You can see what it's decoding at "output", below. (If you run the command, you'll see that there is a lot of info printed after what I am showing here. All of that should be ignored; it's mostly testing code that I have not removed because it held at least a tiny bit of value.)

I don't think I'm out by one byte here and another byte there, because the stuff I'm printing looks correct in the cases where I can check and plausible otherwise. (Reading random bytes gives crazy answer, which I'm not seeing.)

HOWEVER what I am not seeing in the docs is any indication of how the actual data are stored. Obviously, the file has a lot of stuff past the header. But what is it? The docs are simply not telling me.

I see in the .txt file that there are supposed to be 15568 profiles. Taking filesize minus header length and dividing by that,

> (3013196 - 418) / 15568
[1] 193.5238

and the .txt says we have 52 channels, so we have

 (3013196 - 418) / 15568 / 52
[1] 3.721611

which I guess could be 4 bytes per channel, if I'm doing something wrong. So, does that make sense? Well, some of the channels (from the .txt file) seem to be dates etc but a lot are 3 velo values and a baskscatter value, and these are described as "R4" and "I", respectively. Assuming that the I is a 2-byte integer, that suggests that, roughly, the average channel size ought to be a little larger than

> (3*4+2)/4
[1] 3.5

and so I guess that makes some sense.

Things to tackle next

My biggest problems seem to be:

  1. I cannot find anything in the headers that gives record number.
  2. My times do not match those in the .txt file.
  3. I cannot find anything in the docs that tells me what these 52 "channels" are. Indeed, they may just be something created by the sontek post-measurement software.

Output

> d<-read.adp.sontek("~/Dropbox/oce-issue-1637/file.arg",debug=4)
read.adp.sontek(...,from=1,to=(missing),type=(missing),...) {
  file /Users/kelley/Dropbox/oce-issue-1637/file.arg has 3013196 bytes
  read.adp.sontek() recognized argonaut_adp type
  after checking within file (if 'type' not given), infer type='argonaut_adp'
exported bufEXPORT
  about to read 'Argonaut sensor configuration structure' (96 bytes)
  ConfigTime: 2017-05-08 08:35:41
  serialNumber: E5131  [expect E5131 for issue 1637]
  systemType bits:  00 01 00 00 01 00 00 00 
  lowNibble [1:4]: 0, 0, 0, 1
  frequency: 1.5  MHz
  highNibble [1:4]: 0, 0, 1, 0
  systemType: SL
  nbeams: 3
  beam.geometry: 1 ; 0 (2 beams); 1 (3 beams), 2 (4 beams with 1 vertical), 3 (4 beams, Janus)
  slant.angle: 25
  orientation: up
  skipping several things in 'Argonaut sensor configuration structure'
  about to see if we have 'Argonaut operation configuration structure' (64 bytes)
  found 2-byte preface to 'Argonaut operation configuration structure' (64 bytes)
  ConfigTime: 2017-05-08 08:35:41
  NpingsPerBeam: 1
  SampInterval: 4
  Lag: 1
  ProfilingMode: 1  (0=no, 1=yes) [expect 1 for issue 1637]
  Ncells: 10  [expect 11 for issue 1637]
  CellSize: 350  (cm) [expect 350 for issue 1637]
  about to check for a 'User setup parameters structure' (258 bytes)
  found 'User setup configuration structure' (258 bytes)
  ConfigTime: NA
  BlankDistance: 100  cm [expect ?? for issue 1637]
  CellSize: 200  cm [expect 350 for issue 1637]
  DeploymentName: MAS01
  BeginDepoymentTime: 2017-10-03 18:57:33
  CommentLine1: Deployment in Masset Channel
  CommentLine2: 54 00' 07.81'' N, 132 09' 15.74'' W
  CommentLine3: October 2017
  CellBegin: 2700  [expect ??? for issue 1637]
  CellEnd: 3200  [expect ??? for issue 1637]
  WaterDepth: 3200  [expect ??? for issue 1637]
  done with 'User setup configuration structure' (258 bytes); off= 418 
dankelley commented 4 years ago

The code now reads quite a lot of the headers. There are still some things I don't understand (see above, and see the output from running it on the tests file in question with debug=3) but I thought the next step would be to look after the headers.

The code (temporarily) exports BUF and OFF. I won't get into a lot of details, since the code speaks for itself, and since these comments are basically a diary for me to keep track of things, rather to explain my exploratory procedures fully.

Doing this

 n<-1e6
d<-readBin(BUF[OFF+seq(1,2*n)],"integer",size=2,n=n,endian="little")
plot(d,cex=1/5,xlim=c(0,1000))

gives as shown below. Clearly, the bytes are not random, since (a) they are not uniform in the range from 0 to 255 and (b) the probability density function varies in a semi-systematic way over time.

Note that if what we have is headers followed by just data, then things are tricky because velo values are stored in 2 bytes and backscatter values are stored in 1 byte. That mans that some of the patterning might be a consequence of looking at the mass of data as 2-byte values. But, again, that means we certainly have patterns here. The question is -- what patterns.

In the graph, there are patches starting at about 350, 599 and 860. The patches have length about 140 (inferred roughly from calling locator() and clicking on the graph), with quiet zones of length about 110 between. These numbers are not understood by me. I think there are 10 bins (from the header, although the text file says 11). But in either case, we have 3 beams and velo is 2 bytes per datum, so I could understand a pattern with length 1032=60 or maybe 1132=66 but neither is 140 or 110, so I don't know whether I'm seeing something relating to my understanding, or whether I'm looking for patterns like a character in the book/film "contact".

This is all a guessing game. The docs I have (from @richardsc and from Sontek) are not clear for what happens after the headers. One byte in one of the headers suggests that we have so-called "Long" data chunks, but that would mean we should have a lot of two-byte pairs that we clearly do not have.

The sontek docs I possess are 19 years old. My guess is that the format may have changed. Or, perhaps, this doc is simply irrelevant, applying to adv and not adp. Or ... I dunno. I'm basically out of ideas that seem to hold much promise.

a

hhourston commented 4 years ago

Without full documentation of .arg file structure, this may not be useful, but I also have a .mat file version of the .arg file I sent before. If the .mat file structure may be more easily decoded then I am happy to share a copy and open a new issue if that would be appropriate.

dankelley commented 4 years ago

If you have .mat you could try constructing an adp object with as.adp(). My guess is that as.adp() will need a fair bit of new functionality to be useful, though. It may not even work. Generally, the as. functions in oce are very primitive because the real intention is to read raw data files, which we can expect will not be corrupted. With a .mat file, all bets are off; there's no way to know if the contents are right or wrong.

It seems to me that the present issue should be closed. If, some day, we get a manual, it can be reopened. But I use issue as a to-do list, and there is nothing to be done, now at least.

In oce, we as users to close issues, so I'll let you decide, @hhourston.

If you try as.adp() and find it lacking, please open an issue. But I would be very surprised if it works, because the adp object stores data in a compact form (at the byte level) to match manufacture native formats, and my guess is that the matlab file is using 8-byte numeric values instead of e.g. 1-byte "raw" values.

If all you're doing is to draw images, your best plan is

# command to read .mat file. Say the velocity is in a 3 matrix called 'v'
imagep(v[,,1])

to draw an image of the first beam. I have very little expectation that much progress will be made in this sort of thing, though. Someone in the team must have matlab code. You ought to just use that, since we have no way of getting the data into R in a verifiable way.

hhourston commented 4 years ago

I was unsuccessful using as.adp() to open the .mat file. Python has some support for .mat files so I will try that route. Thank you for all your efforts on this -- I will close the issue.

dankelley commented 4 years ago

Oh, the idea was to use read.mat to read the matlab file, then do whatever you want with it. But I don't think as.adp() is going to help after reading it, for reasons explained. I'd just read it, and then "roll my own" analysis to do whatever you want to do, whether that be a graph, a calculation, a hypothesis test, or whatever.

I think there may be several packages for reading matlab in R. Some require a particular (oldish) version of matlab file.

dankelley commented 4 years ago

An example of reading matlab files:

d <- R.matlab::readMat("TempLoc80.mat") 
hhourston commented 4 years ago

Ah, I see. Thanks!