Consider using dbdreader as backend?

jklymak commented 2 years ago

Currently pyglider has its own python-only dbd-numpy converter for slocums vendored from https://gitlab.oceantrack.org/ceotr-public/dinkum

https://github.com/smerckel/dbdreader reads dinkum binary files (slocum)

Pros:

fast; I didn't test, but I bet 10x faster.
maintained by maybe somewhat larger community
having utility installed directly is pretty good way to look at raw dinkum files interactively.

Cons:

c-code
- code is pretty generic so should be easy to compile
- no conda recipe. But they have a wheel?
- not sure ease on Windows? users will need a compiler to install pyglider. Perhaps a pain for potential SeaExplorer-only users? Not sure what OSs the wheel works on...
  - some effort to integrate, though looks minimal.
  - not sure of the commit policy on smerckel/dbdreader. Looks like maybe some commits are direct, rather than via PR? Maybe OK if we are only using released versions.

A quick look indicates this would be pretty easy to implement. We could probably even have both backends, with the faster c backend as optional If the C code turns out to be problematic.

ping @callumrollo @hvdosser @smerckel...

jklymak commented 2 years ago

I worked on this some. The only issue is that currently our slocum reader leaves NaN where sci_m_present_time doesn't have a reading for a sensor. The version I wrote aligns it so there is data on each value of sci_m_present_time using linear interpolation. That is probably OK, but loses some information about what data was measured versus what was interpolated. I'm not sure that we used such information in our processing, but maybe @hvdosser has an opinion based on what she has been working on recently.

smerckel commented 2 years ago

Currently pyglider has its own python-only dbd-numpy converter for slocums vendored from https://gitlab.oceantrack.org/ceotr-public/dinkum

https://github.com/smerckel/dbdreader reads dinkum binary files (slocum) Pros: fast; I didn't test, but I bet 10x faster. maintained by maybe somewhat larger community Cons: * c-codecode is pretty generic so should be easy to compilenot sure

The design idea behind dbdreader reader was to get a small list of parameters of interest as quickly as possible from the original binary Slocum glider data files. From the 2000+ paramters that are present in the binary data, usually a handful are of interest. Also NaNs are not of interest, so rather than reading all parameters into a regular grid full of NaNs, then selecting the parameters of interest, and then filter out the NaNs, dbdreader just returns the values of the parameters asked for. For reasons of speed, the "hard work" is done in C, and at the time (14 years ago or so) it was a good excuse to figure out how to make a C-extention in the first place. Because of the assumption that only a few parameters are requested, the code makes a few short-cuts: the bytes for parameters that are not going to be used, are skipped over; they are not read nor converted, and then dropped. The downside is that if all parameters are requested, then dbdreader is probably less efficient, as the overhead to process each variable increases with the number of parameters requested, and the gain of the algorithm decreases. Then, if all parameters are required in a matrix with a single time vector, then all the NaNs are inserted again. This feature has been added later on request as it turns out that some people actually want the data like this.

The "community that maintains dbdreader" is actually small, just me. However, I have received a number of big fixes over the years from other users. In our group it is used a lot, and I know it is integrated in the data processing chain of BODC (UK) and NOC (UK).

if they have a conda recipe.not sure ease on Windows? users will need a compiler to install pyglider. Perhaps a pain for potential SeaExplorer-only users?

I don't use conda. If a conda recipe is essential, then that can be done. (Pull request :-)

Windows is kind of an issue. Windows users need to get a compiler, and in fact the same compiler as the one used for python itself. I don't have access to a windows computer, but I borrowed one once from IT for this purpose, and the process was not so complicated. Again, if this is required, it is little effort to create binary wheels for pip for example, that can be installed for windows without the need for a compiler. I just need to get a windows computer.

A quick look indicates this would be pretty easy to implement. We could probably even have both backends, with the faster c backend as optional If the C code turns out to be problematic.

When dbdreader was implemented, the binary reading was done in python first, as it was much easier to figure out how to decode these files than doing this in C (at least for me). I think this legacy is still in the code. Probably outdated in some places, although most of the changes over time affected the python (interface) only. The exception is the reading of the G3 data files which have a different byte endianess. So it would also take little effort to reinstate a pure python reader, in case compiling is not an option. But then again, if all parameters are wanted, then the dinkum library may more convenient.

I am happy to make changes to dbdreader if that would help your case.

Lucas

Bitte beachten Sie: Die Helmholtz-Zentrum Geesthacht - Zentrum für Material- und Küstenforschung GmbH hat sich am 31.03.2021 in Helmholtz-Zentrum hereon GmbH umbenannt. Informationen dazu finden Sie unter www.hereon.de/name https://www.hereon.de/name

Please note: Since the 31st of March 2021 the Helmholtz-Zentrum Geesthacht – Center for Materials and Coastal Research has a new name: Helmholtz-Zentrum hereon GmbH. More information www.hereon.de/rebranding https://www.hereon.de/rebranding

Helmholtz-Zentrum hereon GmbH Max-Planck-Straße 1 I 21502 Geesthacht I Deutschland/Germany

Geschäftsführung I Board of Management: Prof. Dr. Matthias Rehahn, Silke Simon Vorsitzender des Aufsichtsrates I Chairman of the Supervisory Board: Ministerialdirigent Engelbert Beyer Amtsgericht Lübeck HRB 285 GE (Register Court) Internet: www.hereon.de https://www.hereon.de

jklymak commented 2 years ago

Thanks @smerckel...

Then, if all parameters are required in a matrix with a single time vector, then all the NaNs are inserted again. This feature has been added later on request as it turns out that some people actually want the data like this.

Oh, is this really available? I didn't find that option (or are you referring to syncCTC?). I'm not sure thats what we really want to do, but would be good to compare.

Now our options are

put everything on sci_m_present_time and leave NaN where each science sensor doesn't sample, and linearly interpolate the engineering sensors. This is what we currently do, but would require some work to make work with dbdreader.
interpolate over NaN so that everything is on sci_m_present_time - this over-interpolation is what I am suggesting
interpolate all the sensors to the CTDs sampling on sci_m_present_time. This way the CTD has no interpolation, but other critical sensors (notably O2) may get some interpolation.

Since both the CTD and O2 end up getting shifted in linear interpolations in later steps anyways it seems to me that over-interpolating onto sci_m_present_time is fine. I think each sensor probably should get a "native dt" value stored in its metadata, and maybe the offset for the first sample so the original data can be recovered if someone wants.

callumrollo commented 2 years ago

Adopting a faster c library as the preferred backend decoder sounds like a good move to me. It would be fairly simple to check if dbdreader is present/functional on the user's computer and default to the current python-only version if not. In my experience, creating a conda recipe from a library with existing PyPI wheels is simple.

If the above issue with NaNs is resolved, I think this would make a good addition to pyglider.

smerckel commented 2 years ago

Then, if all parameters are required in a matrix with a single time vector, then all the NaNs are inserted again. This feature has been added later on request as it turns out that some people actually want the data like this.

Oh, is this really available? I didn't find that option (or are you referring to syncCTC?). I'm not sure thats what we really want to do, but would be good to compare.

It is not pretty, but you could do something like this to get a list of all variables:

import dbdreader dbd = dbdreader.DBD(...)

x=[v for _,v in dbd.get(*dbd.parameterNames, return_nans=True)]

or, in a dictrionary:

x = dict([(k,v) for k,(_,v) in zip(dbd.parameterNames, dbd.get(*dbd.parameterNames, return_nans=True))])

Ofcourse you could have a method get_all() that does that, but I don't see the point of such a thing. I never had the need for that.

Now our options are 1. put everything on sci_m_present_time and leave NaN where each science sensor doesn't sample, and linearly interpolate the engineering sensors. This is what we currently do, but would require some work to make work with dbdreader.

This would be not too hard though. This would presumably make sense only if you use MultiDBD. Then, if you have both dbd and ebd files, the parameters to interpolate are listed in parameterNames["eng"]

Bitte beachten Sie: Die Helmholtz-Zentrum Geesthacht - Zentrum für Material- und Küstenforschung GmbH hat sich am 31.03.2021 in Helmholtz-Zentrum hereon GmbH umbenannt. Informationen dazu finden Sie unter www.hereon.de/name https://www.hereon.de/name

Please note: Since the 31st of March 2021 the Helmholtz-Zentrum Geesthacht – Center for Materials and Coastal Research has a new name: Helmholtz-Zentrum hereon GmbH. More information www.hereon.de/rebranding https://www.hereon.de/rebranding

Helmholtz-Zentrum hereon GmbH Max-Planck-Straße 1 I 21502 Geesthacht I Deutschland/Germany

Geschäftsführung I Board of Management: Prof. Dr. Matthias Rehahn, Silke Simon Vorsitzender des Aufsichtsrates I Chairman of the Supervisory Board: Ministerialdirigent Engelbert Beyer Amtsgericht Lübeck HRB 285 GE (Register Court) Internet: www.hereon.de https://www.hereon.de

jklymak commented 2 years ago

Ah, I missed the return_nans option! That is indeed easy to replicate exactly what we currently have, if that is what we really want. Thanks.

@callumrollo and @hvdosser I'll add this as a new method to slocum.py that can be used instead of binary_to_rawnc, merge_rawnc, and raw_to_timeseries.

jklymak commented 2 years ago

As an update, running a 4-month mission of our delayed-mode data too about 1100 s with the dbdreader backend. It took 10x longer with pyglider's all-python backend. (18 minutes vs 3 h). I feel the dbdreaded is also more robust than our decode to netcdf, paste the netcdfs together, and then make a time series approach. The paste the netcdfs together approach has been the source of considerable headaches when the files don't quite match up, and those headaches have been pretty hard to debug.

@smerckel if you can handle a few more PRs on dbdreader, I had a couple of changes that helped with our processing.

smerckel commented 2 years ago

On Tue, 2022-07-05 at 23:03 -0700, Jody Klymak wrote:

As an update, [...] up, and those headaches have been pretty hard to debug. @smerckel if you can handle a few more PRs on dbdreader, I had a couple of changes that helped with our processing.

Sure, if you will find a non-immediate response acceptable. I may need some 10 days to find the time to have a look at it. Perhaps I can manage it before then.

Bitte beachten Sie: Die Helmholtz-Zentrum Geesthacht - Zentrum für Material- und Küstenforschung GmbH hat sich am 31.03.2021 in Helmholtz-Zentrum hereon GmbH umbenannt. Informationen dazu finden Sie unter www.hereon.de/name https://www.hereon.de/name

Please note: Since the 31st of March 2021 the Helmholtz-Zentrum Geesthacht – Center for Materials and Coastal Research has a new name: Helmholtz-Zentrum hereon GmbH. More information www.hereon.de/rebranding https://www.hereon.de/rebranding

Helmholtz-Zentrum hereon GmbH Max-Planck-Straße 1 I 21502 Geesthacht I Deutschland/Germany

Geschäftsführung I Board of Management: Prof. Dr. Matthias Rehahn, Silke Simon Vorsitzender des Aufsichtsrates I Chairman of the Supervisory Board: Ministerialdirigent Engelbert Beyer Amtsgericht Lübeck HRB 285 GE (Register Court) Internet: www.hereon.de https://www.hereon.de

jklymak commented 2 years ago

I don't think there is any rush on any suggested changes - for internal work we can use our own branches until a decision is made....

smerckel commented 2 years ago

Hi, Just for the sake of completeness, the suggested features have been merged in dbdreader and an updated version (0.4.10) is pushed to pypi.

jklymak commented 2 years ago

Closed by #109

c-proof / pyglider

Consider using dbdreader as backend? #106

Pros:

Cons: