dankelley / ocencdf

Interface from oce to netcdf
https://dankelley.github.io/ocencdf/
2 stars 1 forks source link

metadata attribute ought to be in JSON (was YAML) format #13

Closed dankelley closed 1 year ago

dankelley commented 1 year ago

The advantage of this is that I would be very surprised if e.g. python, julia, or any other analysis language could not translate yaml into something analogous to the list object in R.

The following shows how this can work. Note that as.yaml() doesn't handle expression objects (which I discovered by trial and error).

I'll do this tomorrow.

library(oce)
library(yaml)
data(ctd)
metadata <- ctd@metadata
# as.yaml does not handle 'expression' objects, as in units
for (i in seq_along(metadata$units))
    metadata$units[[i]]$unit <- as.character(metadata$units[[i]]$unit)
yaml <- as.yaml(metadata)
cat(yaml)

Output:

units:
  scan:
    unit: []
    scale: ''
  timeS:
    unit: s
    scale: ''
  pressure:
    unit: dbar
    scale: ''
  depth:
    unit: m
    scale: ''
  temperature:
    unit: degree * C
    scale: IPTS-68
  salinity:
    unit: []
    scale: PSS-78
  flag:
    unit: []
    scale: ''
flags: []
pressureType: sea
deploymentType: unknown
waterDepth: .na
dataNamesOriginal:
  scan: scan
  timeS: timeS
  pressure: pr
  depth: depS
  temperature: t068
  salinity: sal00
  flag: flag
model: '25'
header:
- '* Sea-Bird SBE 25 Data File:'
- '* FileName = C:\SEASOFT3\BASIN\BED0302.HEX'
- '* Software Version 4.230a'
- '* Temperature SN = 1140'
- '* Conductivity SN = 832'
- '* System UpLoad Time = Oct 15 2003 11:38:38'
- '* Command Line = seasave '
- '** Ship:      Divcom3'
- '** Cruise:    Halifax Harbour'
- '** Station:   Stn 2'
- '** Latitude:  N44 41.056'
- '** Longitude: w63 38.633'
- '* Real-Time Sample Interval = 1.000 seconds'
- '# nquan = 7'
- '# nvalues = 773                     '
- '# units = metric'
- '# name 0 = scan: scan number'
- '# name 1 = timeS: time [s]'
- '# name 2 = pr: pressure [db]'
- '# name 3 = depS: depth, salt water [m]'
- '# name 4 = t068: temperature, IPTS-68 [deg C]'
- '# name 5 = sal00: salinity, PSS-78 [PSU]'
- '# name 6 = flag:  0.000e+00'
- '# span 0 = 1, 773                       '
- '# span 1 = 0.000, 772.000               '
- '# span 2 = -0.378, 163.899              '
- '# span 3 = -0.375, 162.504              '
- '# span 4 = 2.3237, 99.0000              '
- '# span 5 = 0.3276, 99.0000              '
- '# span 6 = 0.000e+00, 0.000e+00         '
- '# interval = seconds: 1                           '
- '# start_time = Oct 15 1903 11:38:38'
- '# bad_flag = -9.990e-29'
- '# sensor 0 = Frequency 0  temperature, 1140, 13 Mar 03'
- '# sensor 1 = Frequency 1  conductivity, 832, 13 Mar 03, cpcor = -9.5700e-08'
- '# sensor 2 = Pressure Voltage, 145033, 17 Mar 03, cpcor = -9.5700e-08'
- '# sensor 3 = Stored Volt  0  transmissometer'
- '# datcnv_date = Oct 15 2003 13:46:47, 4.230a'
- '# datcnv_in = BED0302.HEX BED0301.CON'
- '# datcnv_skipover = 0'
- '# file_type = ascii'
- '*END*'
type: SBE
hexfilename: c:\seasoft3\basin\bed0302.hex
serialNumber: ''
serialNumberTemperature: '1140'
serialNumberConductivity: '832'
systemUploadTime: 1.0662179e+09
ship: Divcom3
scientist: ''
institute: ''
address: ''
cruise: Halifax Harbour
station: Stn 2
date: 1.0662179e+09
startTime: 1.0662323e+09
recoveryTime: .na
latitude: 44.6842667
longitude: -63.6438833
sampleInterval: 1.0
sampleIntervalUnits: s
filename: /Users/kelley/git/oce/create_data/ctd/ctd.cnv
dankelley commented 1 year ago

Matrices will also have to be handled specially (on reconstitution) because YAML has no way to represent matrices (or at least yaml::as_yaml() has no way to do that, and my web searching suggests that YAML has no way, either).

The workaround is to construct a matrix at the reconstitution phase. This will be necessary only in certain files (e.g. adp files can have a rotation matrix) and so the author of reconstitution code will need to be aware of this. But those users ought to have some skill (we are not talking about using excel here) and I think it will be sufficient to (1) document the special-case items and (2) insert a class-specific explanation as a global attribute in the ncdf file.

dankelley commented 1 year ago

Another possibility is to use JSON. I just did some checking and, like YAML, it cannot handle the expression type, so a tweak will be needed on units. However, it can handle matrices. Well, with an exception: it doesn't seem to handle matrices of raw items:

> library(oce)
> data(adp)
> madp <- adp@metadata
> madp$codes
     [,1] [,2]
[1,]   7f   7f
[2,]   00   00
[3,]   80   00
[4,]   00   01
[5,]   00   02
[6,]   00   03
[7,]   00   04
> toJSON(madp$codes)
Error in dim(m) <- dim(x) : 
  dims [product 14] do not match the length of object [1]
> toJSON(as.integer(madp$codes))
[127,0,128,0,0,0,0,127,0,0,1,2,3,4] 
> C <- madp$codes
> C <- as.integer(madp$codes)
> dim(C) <- dim(madp$codes)
> toJSON(C, pretty=TRUE)
[
  [127, 127],
  [0, 0],
  [128, 0],
  [0, 1],
  [0, 2],
  [0, 3],
  [0, 4]
] 
dankelley commented 1 year ago

I am now leaning towards JSON, rather than YAML. Here's why:

  1. JSON is likely more easily understood by more programmers/users.
  2. YAML will require codes to special things based on knowledge of the data. For example, a converter will need to know that transformationMatrix is, indeed, a matrix. And it's dimension will have to be inferred from other knowledge or from an additional item called maybe transformationMatrixDimension. But then the code would have to use that second thing to reformat the first thing, and then remember not to include that second thing in the results.
  3. I don't see any real problem in converting that raw matrix to an integer one. It can always be converted back in user code. And oce doesn't use it, anyway (IIRC).
dankelley commented 1 year ago

I've made a test code (not pushed, and will be in a new branch called JSON), and it seems to work (click Details to see for CTD, and next issue for ADP).

Results for `data("ctd")` metadata. ```JSON { "units": { "scan": { "unit": [], "scale": [""] }, "timeS": { "unit": ["s"], "scale": [""] }, "pressure": { "unit": ["dbar"], "scale": [""] }, "depth": { "unit": ["m"], "scale": [""] }, "temperature": { "unit": ["degree * C"], "scale": ["IPTS-68"] }, "salinity": { "unit": [], "scale": ["PSS-78"] }, "flag": { "unit": [], "scale": [""] } }, "flags": [], "pressureType": ["sea"], "deploymentType": ["unknown"], "waterDepth": [null], "dataNamesOriginal": { "scan": ["scan"], "timeS": ["timeS"], "pressure": ["pr"], "depth": ["depS"], "temperature": ["t068"], "salinity": ["sal00"], "flag": ["flag"] }, "model": ["25"], "header": ["* Sea-Bird SBE 25 Data File:", "* FileName = C:\\SEASOFT3\\BASIN\\BED0302.HEX", "* Software Version 4.230a", "* Temperature SN = 1140", "* Conductivity SN = 832", "* System UpLoad Time = Oct 15 2003 11:38:38", "* Command Line = seasave ", "** Ship: Divcom3", "** Cruise: Halifax Harbour", "** Station: Stn 2", "** Latitude: N44 41.056", "** Longitude: w63 38.633", "* Real-Time Sample Interval = 1.000 seconds", "# nquan = 7", "# nvalues = 773 ", "# units = metric", "# name 0 = scan: scan number", "# name 1 = timeS: time [s]", "# name 2 = pr: pressure [db]", "# name 3 = depS: depth, salt water [m]", "# name 4 = t068: temperature, IPTS-68 [deg C]", "# name 5 = sal00: salinity, PSS-78 [PSU]", "# name 6 = flag: 0.000e+00", "# span 0 = 1, 773 ", "# span 1 = 0.000, 772.000 ", "# span 2 = -0.378, 163.899 ", "# span 3 = -0.375, 162.504 ", "# span 4 = 2.3237, 99.0000 ", "# span 5 = 0.3276, 99.0000 ", "# span 6 = 0.000e+00, 0.000e+00 ", "# interval = seconds: 1 ", "# start_time = Oct 15 1903 11:38:38", "# bad_flag = -9.990e-29", "# sensor 0 = Frequency 0 temperature, 1140, 13 Mar 03", "# sensor 1 = Frequency 1 conductivity, 832, 13 Mar 03, cpcor = -9.5700e-08", "# sensor 2 = Pressure Voltage, 145033, 17 Mar 03, cpcor = -9.5700e-08", "# sensor 3 = Stored Volt 0 transmissometer", "# datcnv_date = Oct 15 2003 13:46:47, 4.230a", "# datcnv_in = BED0302.HEX BED0301.CON", "# datcnv_skipover = 0", "# file_type = ascii", "*END*"], "type": ["SBE"], "hexfilename": ["c:\\seasoft3\\basin\\bed0302.hex"], "serialNumber": [""], "serialNumberTemperature": ["1140"], "serialNumberConductivity": ["832"], "systemUploadTime": ["2003-10-15 11:38:38"], "ship": ["Divcom3"], "scientist": [""], "institute": [""], "address": [""], "cruise": ["Halifax Harbour"], "station": ["Stn 2"], "date": ["2003-10-15 11:38:38"], "startTime": ["2003-10-15 15:38:38"], "recoveryTime": [null], "latitude": [44.6843], "longitude": [-63.6439], "sampleInterval": [1], "sampleIntervalUnits": ["s"], "filename": ["/Users/kelley/git/oce/create_data/ctd/ctd.cnv"] } ```
dankelley commented 1 year ago

Results for data("adp") metadata

```JSON { "units": { "v": { "unit": ["m/s"], "scale": [""] }, "distance": { "unit": ["m"], "scale": [""] }, "pressure": { "unit": ["dbar"], "scale": [""] }, "salinity": { "unit": [], "scale": ["PSS-78"] }, "temperature": { "unit": ["degree * C"], "scale": ["ITS-90"] }, "soundSpeed": { "unit": ["m/s"], "scale": [""] }, "heading": { "unit": ["degree"], "scale": [""] }, "pitch": { "unit": ["degree"], "scale": [""] }, "roll": { "unit": ["degree"], "scale": [""] }, "headingStd": { "unit": ["degree"], "scale": [""] }, "pitchStd": { "unit": ["degree"], "scale": [""] }, "rollStd": { "unit": ["degree"], "scale": [""] }, "attitude": { "unit": ["degree"], "scale": [""] }, "depth": { "unit": ["m"], "scale": [""] } }, "flags": [], "oceCoordinate": ["enu"], "orientation": ["upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward", "upward"], "instrumentType": ["adcp"], "instrumentSubtype": ["workhorse"], "firmwareVersionMajor": [16], "firmwareVersionMinor": [28], "firmwareVersion": ["16.28"], "bytesPerEnsemble": [1832], "systemConfiguration": ["11001011-01000001"], "frequency": [600], "beamAngle": [20], "beamPattern": ["convex"], "beamConfig": ["janus"], "numberOfDataTypes": [6], "dataOffset": [18, 77, 142, 816, 1154, 1492], "codes": [ [127, 127], [0, 0], [128, 0], [0, 1], [0, 2], [0, 3], [0, 4] ], "numberOfBeams": [4], "numberOfCells": [84], "pingsPerEnsemble": [20], "cellSize": [0.5], "transducerDepth": [0], "profilingMode": [1], "lowCorrThresh": [0], "numberOfCodeReps": [2], "percentGdMinimum": [0], "errorVelocityMaximum": [5000], "coordTransform": ["00000111"], "originalCoordinate": ["beam"], "tiltUsed": [true], "threeBeamUsed": [true], "binMappingUsed": [true], "headingAlignment": [0], "headingBias": [0], "sensorSource": ["01111111"], "sensorsAvailable": ["00111101"], "bin1Distance": [2.23], "xmitPulseLength": [1.35], "wpRefLayerAverage": [1281], "falseTargetThresh": [50], "transmitLagDistance": [86], "cpuBoardSerialNumber": [158, 0, 0, 3, 1, 160, 95, 9], "systemBandwidth": [0], "serialNumber": ["(redacted)"], "haveActualData": [true], "ensembleNumber": [5041, 5401, 5761, 6121, 6481, 6841, 7201, 7561, 7921, 8281, 8641, 9001, 9361, 9721, 10081, 10441, 10801, 11161, 11521, 11881, 12241, 12601, 12961, 13321, 13681], "manufacturer": ["teledyne rdi"], "filename": ["(redacted)"], "longitude": [-69.7343], "latitude": [47.8813], "ensembleInFile": [9243361, 9903601, 10563841, 11224081, 11884321, 12544561, 13204801, 13865041, 14525281, 15185521, 15845761, 16506001, 17166241, 17826481, 18486721, 19146961, 19807201, 20467441, 21127681, 21787921, 22448161, 23108401, 23768641, 24428881, 25089121], "velocityResolution": [0.001], "velocityMaximum": [32.768], "numberOfSamples": [25], "oceBeamUnspreaded": [false], "depthMean": [38.792], "transformationMatrix": [ [1.4619, -1.4619, 0, 0], [0, 0, -1.4619, 1.4619], [0.266, 0.266, 0.266, 0.266], [1.0337, 1.0337, -1.0337, -1.0337] ], "headSerialNumber": ["(redacted)"], "deploymentName": ["(redacted)"], "comments": ["sample ADP file"] } ```
dankelley commented 1 year ago

Since I'm using this issue to take notes, the following shows how to reconstitute expressions.

> ctd[["temperatureUnit"]]$unit
expression(degree * C)
> as.character(ctd[["temperatureUnit"]]$unit)
[1] "degree * C"
> parse(text=as.character(ctd[["temperatureUnit"]]$unit))
expression(degree * C)
dankelley commented 1 year ago

To the JSON branch, I've added new low-level functions. This includes tests in the test suite, for ctd and adp built-in datasets. I will try some other datasets, to find remaining special cases. So far, the special cases involve POSIX times, which in JSON become simple character strings, unit expressions, and he codes matrix from read.adp.rdi().

commit 6f88923b12ee143f30adc4ea16581100f4388d39 Author: dankelley kelley.dan@gmail.com Date: Sun Jun 18 10:21:43 2023 -0300

add json2metadata() and metadata2json()

These will be used by oce2ncdf() and ncdf2oce(), respectively.
dankelley commented 1 year ago

This work has been completed in 'main' commit b7d7dd65cd3fa81d8f498e715a628d3dfae71a3b:

  1. The test suite has both low- and high-level checks.
  2. ?ocencdf explains some minor post hoc conversions that are required for full recovery of the metadata. (These are done in the package but are unlikely to be desired for other language approaches.)
  3. There is now a global attribute in the Netcdf file that explains those post hoc conversions.