Open sspagnol opened 5 years ago
Ok. I thought genvarname would work that same on 64-bit Windows and Linux Matlab, it doesn't. On Windows you would get RBRduet0xB3, and on Linux I'm getting RBRduet0xFFFD. So I don't know what to do in this case.
header.model = genvarname(char(unicode2native(tkns{1}{1}, 'US-ASCII')));
gives the same result of 'RBRduet0x1A' so then just search-replace '0x1A' with '3'? But then if RBR release a '4' version would have to add another search-replace.
@sspagnol , good refactoring there.
Ok. I thought genvarname would work that same on 64-bit Windows and Linux Matlab, it doesn't. On Windows you would get RBRduet0xB3, and on Linux I'm getting RBRduet0xFFFD. So I don't know what to do in this case. header.model = genvarname(char(unicode2native(tkns{1}{1}, 'US-ASCII')));
This is bad - now I understand why there are so many comparisons with escaped sequenced strings.
IMO, this is a noisy conversion to get a string right. I look into the genvarname help and it's supposedly marked for removal in future releases. did you tried the matlab.lang.makeUniqueStrings or matlab.lang.makeValidName? they return the proper names without escaping sequences in my limited trials.
Nonetheless, let's stop depending on this function and matches with escape sequences...it's bad and as above, can cause architecture-dependent bugs.
I need 3 things to merge this:
would change line to
modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
header.model = modelstring;
The matlab.lang.makeValidName construct with hex replacement gave me identical results on R2019a WIN64 and R2018a GLNXA64. Whether you code the string replacements as just a series of regexprep or as an action if a certain substring is found is up to you. Assumes RBR doesn't go crazy and uses unicode characters above 0xFFFF.
Attached short example file for RBRduet3 082533_20190525_0515_eng_rbrduet3.zip and RBRsolo 076019_20160617_0239_rbrsolo_example.zip
I just loaded up the file and looked at the metadata
filename = '082533_20190525_0515_eng_rbrduet3.txt'
sample_data = XRParse({filename}, 'timeSeries')
sample_data.meta.instrument_model
ans =
'RBRduet³'
whereas I wanted RBRduet3
, I also think the subscripted 3 also wasnt' being shown properly in toolbox (dropdowns/plot legends) and I think also not passed properly into the exported netcdf file.
Yes includes new pressure variable name. As instrument software is written by RBR I've know idea if there are 'new fields on the roadmap'.
The matlab.lang.makeValidName does have a downside. A model string of 'TDR-2050' would be converted to 'TDR0x2D2050' (because you can't have '-' in a Matlab variable name).
So one alternative is to just code for the exceptions like
modelstring = matlab.lang.makeValidName(tkns{1}{1}, 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0xB3$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;
Another would be to only convert non-asci characters to their hex equivalents, then do clean up.
Another annoying update : I am still getting inconsistent results between Windows + Linux + version versions of Matlab. Think that the fgetl
call to read in the text data is the cause. From the manual
fgetl reads characters using the encoding scheme associated with the file. To specify the encoding scheme, use fopen.
So (on my installation) Windows it is opening as 'windows-1252' encoding, and 'UTF-8' under linux .
So choices could be either use the unicode2native
call like and convert to say 'US-ASCII'
modelstring = matlab.lang.makeValidName(char(unicode2native(tkns{1}{1}, 'US-ASCII')), 'ReplacementStyle','hex');
% known model string unicode replacements
modelstring = regexprep(modelstring, '0x1A$', '3'); % subscripted 3
modelstring = regexprep(modelstring, '^TR0x2D', 'TR-'); % 'TR-'
modelstring = regexprep(modelstring, '^TDR0x2D', 'TDR-'); % 'TDR-'
header.model = modelstring;
or alternatively use fopen
with encoding set to 'US-ASCII' (or 'UTF-8') and modify regexprep accordingly.
Which do you want to do?
Surly matlab.lang.makeValidName is not really what you want here,
if you must what about,
modelstring = regexprep(modelstring, '0x2D', '-'); % convert 0x2D to its character
what about
modelstring = char(unicode2native(tkns{1}{1}, 'US-ASCII')) modelsring(modelstring < ' ' && modelstring > 'f') = ''
have not checked if that works
@sspagnol,
I think I found the problem here.Looks like matlab fopen do not detect the correct encoding in auto-mode:
>> fid=fopen(file);
>> [fname,fperm,fformat,fenc] = fopen(fid);
>> fenc
fenc =
'UTF-8'
but the file is not UTF-8, otherwise the string below would be correctly printed:
>> firstline=textscan(fid,'%s\n');
>> fstr=firstline{1}{1}
fstr =
'Model=RBRduet�'
>>
Apparently, we shouldn't rely on Matlab automatic encoding detection. I was surprised not to have an error when reading in UTF-8 mode.
I had to loop through the encodings to see what was the correct one. it's some ISO-8859 variant or windows-12** variant.
Model=RBRduet� with Big5
Model=RBRduet³ with ISO-8859-1
Model=RBRduetณ with windows-874
Model=RBRduet� with Big5-HKSCS
Model=RBRduetł with ISO-8859-2
Model=RBRduet� with windows-949
Model=RBRduet with CP949
Model=RBRduet³ with ISO-8859-3
Model=RBRduetł with windows-1250
Model=RBRduet with EUC-KR
Model=RBRduetŗ with ISO-8859-4
Model=RBRduetі with windows-1251
Model=RBRduet with EUC-JP
Model=RBRduetГ with ISO-8859-5
Model=RBRduet³ with windows-1252
Model=RBRduet with EUC-TW
Model=RBRduet� with ISO-8859-6
Model=RBRduet³ with windows-1253
Model=RBRduet� with GB18030
Model=RBRduet³ with ISO-8859-7
Model=RBRduet³ with windows-1254
Model=RBRduet with GB2312
Model=RBRduet³ with ISO-8859-8
Model=RBRduet³ with windows-1255
Model=RBRduet� with GBK
Model=RBRduet³ with ISO-8859-9
Model=RBRduet³ with windows-1256
Model=RBRduet│ with IBM866
Model=RBRduetณ with ISO-8859-11
Model=RBRduet³ with windows-1257
Model=RBRduetЁ with KOI8-R
Model=RBRduet³ with ISO-8859-13
Model=RBRduet³ with windows-1258
Model=RBRduetЁ with KOI8-U
Model=RBRduet³ with ISO-8859-15
Model=RBRduet� with US-ASCII
Model=RBRduet≥ with Macintosh
Model=RBRduet� with UTF-8
Model=RBRduetウ with Shift_JIS
Hence, your problem of not showing the string correctly in plots is because the encoding used is wrong. For example, If I use the correct encoding (say ISO), I can get the string right in plots:
Nonetheless, I don't see a reason to use genvarname (and variants) to transform the string to a valid variable name if this will never be used as a varname.
The correct string is also written correctly to netCDF:
>> nccreate('nctest.nc','test');
>> ncwriteatt('nctest.nc','test','instrument',fstr)
>> !ncdump -h nctest.nc
netcdf nctest {
variables:
double test ;
test:instrument = "Model=RBRduet³" ;
}
And just an FYI the newer versions of Ruskin (have only tested v2.11.1, not sure when in the v2.x series this occured) no longer has legacy Rtext engineering export option. It now exports a zipped of set of text files consisting of (file == serial number/datetime identifier)
file_metadata.txt: yaml file describing the data.
file_data.txt: csv file of data, one header line, format described in metadata.txt
file_events.txt: log of some event(s), not exactly sure what yet.
Have asked for descriptions of new format. Have attached an example. 203097_20201003_0127.zip
So maybe need to look at #444 (utilizing RBR RSKtools for MATLAB within imos-toolbox) again.
IMO, If RBR is using sqlite file format, than we may also read straight from it instead of relying on another text exported format (json+csv apparently). Unless, of course, these exported files are the only way to touch any processed datastream created by their software.
It will also be likely faster than anything else, with rich metadata, and probably without (hopefully) many breaking changes, since any breaks will follow up on their software stack as well.
Let's discuss details over #444, maybe we need to arrange some discussion with rbr users and see what is most portable approach here.