aodn / imos-toolbox

Graphical tool for QC'ing and NetCDF'ing oceanographic datasets
GNU General Public License v3.0
46 stars 31 forks source link

RBR engineering file import cleanup. #556

Open sspagnol opened 5 years ago

sspagnol commented 5 years ago
sspagnol commented 5 years ago

Ok. I thought genvarname would work that same on 64-bit Windows and Linux Matlab, it doesn't. On Windows you would get RBRduet0xB3, and on Linux I'm getting RBRduet0xFFFD. So I don't know what to do in this case.

sspagnol commented 5 years ago
header.model = genvarname(char(unicode2native(tkns{1}{1}, 'US-ASCII')));

gives the same result of 'RBRduet0x1A' so then just search-replace '0x1A' with '3'? But then if RBR release a '4' version would have to add another search-replace.

ocehugo commented 5 years ago

@sspagnol , good refactoring there.

Ok. I thought genvarname would work that same on 64-bit Windows and Linux Matlab, it doesn't. On Windows you would get RBRduet0xB3, and on Linux I'm getting RBRduet0xFFFD. So I don't know what to do in this case. header.model = genvarname(char(unicode2native(tkns{1}{1}, 'US-ASCII')));

This is bad - now I understand why there are so many comparisons with escaped sequenced strings.

IMO, this is a noisy conversion to get a string right. I look into the genvarname help and it's supposedly marked for removal in future releases. did you tried the matlab.lang.makeUniqueStrings or matlab.lang.makeValidName? they return the proper names without escaping sequences in my limited trials.

Nonetheless, let's stop depending on this function and matches with escape sequences...it's bad and as above, can cause architecture-dependent bugs.

I need 3 things to merge this:

  1. Is this ready to merge?
  2. A test file for checking previous code and this one (better if you include an old file and the new one that requires this code).
  3. Please provide an expected structure after reading if possible - the struct is usually very dense and hard to compare item wise.
  4. data-wise the only change is the ability to include pres26 or there are new fields on the roadmap?
sspagnol commented 5 years ago
  1. would change line to

    modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
    % known model string unicode replacements
    modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
    header.model = modelstring;

    The matlab.lang.makeValidName construct with hex replacement gave me identical results on R2019a WIN64 and R2018a GLNXA64. Whether you code the string replacements as just a series of regexprep or as an action if a certain substring is found is up to you. Assumes RBR doesn't go crazy and uses unicode characters above 0xFFFF.

  2. Attached short example file for RBRduet3 082533_20190525_0515_eng_rbrduet3.zip and RBRsolo 076019_20160617_0239_rbrsolo_example.zip

  3. I just loaded up the file and looked at the metadata

    filename =  '082533_20190525_0515_eng_rbrduet3.txt'
    sample_data = XRParse({filename}, 'timeSeries')
    sample_data.meta.instrument_model
    ans =
    'RBRduet³'

    whereas I wanted RBRduet3, I also think the subscripted 3 also wasnt' being shown properly in toolbox (dropdowns/plot legends) and I think also not passed properly into the exported netcdf file.

  4. Yes includes new pressure variable name. As instrument software is written by RBR I've know idea if there are 'new fields on the roadmap'.

sspagnol commented 5 years ago

The matlab.lang.makeValidName does have a downside. A model string of 'TDR-2050' would be converted to 'TDR0x2D2050' (because you can't have '-' in a Matlab variable name).

So one alternative is to just code for the exceptions like

modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;

Another would be to only convert non-asci characters to their hex equivalents, then do clean up.

sspagnol commented 5 years ago

Another annoying update : I am still getting inconsistent results between Windows + Linux + version versions of Matlab. Think that the fgetl call to read in the text data is the cause. From the manual

fgetl reads characters using the encoding scheme associated with the file. To specify the encoding scheme, use fopen.

So (on my installation) Windows it is opening as 'windows-1252' encoding, and 'UTF-8' under linux .

So choices could be either use the unicode2native call like and convert to say 'US-ASCII'

modelstring = matlab.lang.makeValidName(char(unicode2native(tkns{1}{1}, 'US-ASCII')), 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0x1A$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;

or alternatively use fopen with encoding set to 'US-ASCII' (or 'UTF-8') and modify regexprep accordingly.

Which do you want to do?

petejan commented 5 years ago

Surly matlab.lang.makeValidName is not really what you want here,

if you must what about,

modelstring = regexprep(modelstring, '0x2D', '-'); % convert 0x2D to its character

what about

modelstring = char(unicode2native(tkns{1}{1}, 'US-ASCII')) modelsring(modelstring < ' ' && modelstring > 'f') = ''

have not checked if that works

ocehugo commented 5 years ago

@sspagnol,

I think I found the problem here.Looks like matlab fopen do not detect the correct encoding in auto-mode:

>> fid=fopen(file);
>> [fname,fperm,fformat,fenc] = fopen(fid);
>> fenc

fenc =

    'UTF-8'

but the file is not UTF-8, otherwise the string below would be correctly printed:

>> firstline=textscan(fid,'%s\n');
>> fstr=firstline{1}{1}

fstr =

    'Model=RBRduet�'

>>

Apparently, we shouldn't rely on Matlab automatic encoding detection. I was surprised not to have an error when reading in UTF-8 mode.

I had to loop through the encodings to see what was the correct one. it's some ISO-8859 variant or windows-12** variant.

Model=RBRduet� with Big5
Model=RBRduet³ with ISO-8859-1
Model=RBRduetณ with windows-874
Model=RBRduet� with Big5-HKSCS
Model=RBRduetł with ISO-8859-2
Model=RBRduet� with windows-949
Model=RBRduet with CP949
Model=RBRduet³ with ISO-8859-3
Model=RBRduetł with windows-1250
Model=RBRduet with EUC-KR
Model=RBRduetŗ with ISO-8859-4
Model=RBRduetі with windows-1251
Model=RBRduet with EUC-JP
Model=RBRduetГ with ISO-8859-5
Model=RBRduet³ with windows-1252
Model=RBRduet with EUC-TW
Model=RBRduet� with ISO-8859-6
Model=RBRduet³ with windows-1253
Model=RBRduet� with GB18030
Model=RBRduet³ with ISO-8859-7
Model=RBRduet³ with windows-1254
Model=RBRduet with GB2312
Model=RBRduet³ with ISO-8859-8
Model=RBRduet³ with windows-1255
Model=RBRduet� with GBK
Model=RBRduet³ with ISO-8859-9
Model=RBRduet³ with windows-1256
Model=RBRduet│ with IBM866
Model=RBRduetณ with ISO-8859-11
Model=RBRduet³ with windows-1257
Model=RBRduetЁ with KOI8-R
Model=RBRduet³ with ISO-8859-13
Model=RBRduet³ with windows-1258
Model=RBRduetЁ with KOI8-U
Model=RBRduet³ with ISO-8859-15
Model=RBRduet� with US-ASCII
Model=RBRduet≥ with Macintosh
Model=RBRduet� with UTF-8
Model=RBRduetウ with Shift_JIS

Hence, your problem of not showing the string correctly in plots is because the encoding used is wrong. For example, If I use the correct encoding (say ISO), I can get the string right in plots:

test

Nonetheless, I don't see a reason to use genvarname (and variants) to transform the string to a valid variable name if this will never be used as a varname.

The correct string is also written correctly to netCDF:

>> nccreate('nctest.nc','test');
>> ncwriteatt('nctest.nc','test','instrument',fstr)
>> !ncdump -h nctest.nc
netcdf nctest {
variables:
    double test ;
        test:instrument = "Model=RBRduet³" ;
}
sspagnol commented 4 years ago

And just an FYI the newer versions of Ruskin (have only tested v2.11.1, not sure when in the v2.x series this occured) no longer has legacy Rtext engineering export option. It now exports a zipped of set of text files consisting of (file == serial number/datetime identifier)

file_metadata.txt: yaml file describing the data.

file_data.txt: csv file of data, one header line, format described in metadata.txt

file_events.txt: log of some event(s), not exactly sure what yet.

Have asked for descriptions of new format. Have attached an example. 203097_20201003_0127.zip

So maybe need to look at #444 (utilizing RBR RSKtools for MATLAB within imos-toolbox) again.

ocehugo commented 4 years ago

IMO, If RBR is using sqlite file format, than we may also read straight from it instead of relying on another text exported format (json+csv apparently). Unless, of course, these exported files are the only way to touch any processed datastream created by their software.

It will also be likely faster than anything else, with rich metadata, and probably without (hopefully) many breaking changes, since any breaks will follow up on their software stack as well.

Let's discuss details over #444, maybe we need to arrange some discussion with rbr users and see what is most portable approach here.