Interpolation and time base choices when processing slocum data

smwoodman commented 4 months ago

Hello - thanks for developing and sharing this awesome toolbox.

My group and I (at the NOAA/SWFSC) are looking to switch over to pyglider for our slocum glider data processing. I've been playing with outputs (from slocum.binary_to_timeseries) relative to our past outputs (from SOCIB Matlab toolbox), and I have a question about how to best use pyglider with our data. Also, off the top, happy to move this to a Discussion if that's more appropriate.

For a number of our past deployments, our science sampling strategy has been to only sample on dives. Thus, if we pass a sci variables (eg "sci_water_temp") to binary_to_timeseries() as 'time_base', then our timeseries only consists of the dives. Also, on these deployments the glider computer only samples every 4-5s, and so using eg "m_depth" as 'time_base' means we'd be downsampling the science data, which we'd like to avoid.

An additional factor is that we often have a variety of additional sensors on our gliders, often either a Nortek echosounder or shadowgraph camera in addition to oxygen/eco puck, and so we want to stay flexible to whatever sampling strategies we decide to use across instruments in the future. (A brief note recognizing that using sci_m_present_time for all sensors means slightly interpolating values from other sci sensors, eg oxygen, but as y'all have chosen we're ok with this)

Do you have thoughts/suggestions for how to use pyglider.slocum and 1) ensure the timeseries has both the glider's dives and climbs and 2) not downsample the science sensors?

Per the dbdreader backend thread, perhaps the best option for what we're looking for is to use something like @smerckel's suggestion from that thread of x=[v for _,v in dbd.get(*dbd.parameterNames, return_nans=True)], put the engineering and science variables are each put in their own data frames, and then do a pd.merge(df_eng, df_sci, "outer", left_index=True, right_index=True). If you agree, is this something you'd be open to as a pr?

jklymak commented 4 months ago

@smwoodman In this instance, I'd make multiple timeseries using multiple yaml files to select what you want in those timeseries, and then merge them in post if desired. I'd argue that is basically the same as making two data frames and then passing those to the user.

We don't merge them, but we for instance make an engineering timeseries that uses m_depth as a key, but doesn't try to include any science data.

I think it would be hard to merge data sampled at different rates in a way that all users would be happy with, though I admit the current code does that if you aren't paying attention.

I will also note that our group has noted some annoyances with multiple yaml files in that if you want the 'science.yaml' and the 'engineering.yaml' to have proper metadata, they need to be copied across.

jklymak commented 4 months ago

BTW I note a documentation deficiency in that time_base is not actually described!

jklymak commented 4 months ago

Thinking about this a bit more, I guess another approach would be to allow time_base to be specified in the YAML for each variable. That gets messier as each time co-ordinate would need to have a different name but netcdf allows you to have something like:

coordinates: time_sci(10_000), time_eng(740)
variables: temperature(time_sci), heading(time_eng)

But to me the only advantage of that is saving a bit of yaml copy and paste, and maybe making it more explicit what variable you are sampling onto.

smwoodman commented 4 months ago

BTW I note a documentation deficiency in that _timebase is not actually described!

Also, fyi slocum.binary_to_timeseries() doesn't show up in https://pyglider.readthedocs.io/en/latest/pyglider/pyglider.html :) I assume because it isn't part of __all__ here? Happy to open a new issue for this if you want

smwoodman commented 4 months ago

@jklymak thanks for the quick response!

Thinking about this a bit more, I guess another approach would be to allow _timebase to be specified in the YAML for each variable. That gets messier as each time co-ordinate would need to have a different name but netcdf allows you to have something like:
coordinates: time_sci(10_000), time_eng(740)
variables: temperature(time_sci), heading(time_eng)
But to me the only advantage of that is saving a bit of yaml copy and paste, and maybe making it more explicit what variable you are sampling onto.

I agree this seems like it would get real messy real quick, and add another layer for data users to have to understand.

@smwoodman In this instance, I'd make multiple timeseries using multiple yaml files to select what you want in those timeseries, and then merge them in post if desired. I'd argue that is basically the same as making two data frames and then passing those to the user.

We don't merge them, but we for instance make an engineering timeseries that uses m_depth as a key, but doesn't try to include any science data.

This makes sense, and we've thought about it too. Even when considering just science though, because of the variety of sensors/instruments we use we'd prefer that we didn't have to specify one sensor to which all of the other sensors would be limited/interpolated. Specifically, rather than MultiDBD.get_sync, we'd prefer to use MultiDBD.get so that we can use _returnnans=True. Thus, no time points would be 'lost', such as if sci_water_temp has an NaN but sci_oxy4_oxygen has a value, and we would retain as much resolution as possible across the sensors

Admittedly, I haven't gone digging to see what the biggest time gap is between where the CTD/oxygen/eco puck values are not NaN in our data, to see eg how many time points would be 'lost' and the actual time span over which they would be interpolated. And, yes we will be doing at least some level of interpolating later. But we would still prefer to start with the 'full span timeseries' from the start to stay the most flexible.

Maybe I can make a fork trying this (using/merging MultiDBD.get_sync(..., return_nans=True), and if I like where it goes then you can see if you want to consider incorporating it?

jklymak commented 4 months ago

I understand where you are coming from about preserving raw data, but I think pyglider is flexible enough to do that for you.

If you combine into one file, I think you'd find such a time series would be crazy large because none of the sensors exactly line up in time. We had this issue with the SeaExplorer where one sensor was sampling 12 Hz, and the others 1 Hz, and the file was huge for no good reason. Again, if I had 4 instruments that I wanted to keep the raw uninterpolated data for, then I'd make 4 files with the 4 instruments as the time base, and then do the interpolation outside of pyglider.

In general the CTD should be sampling faster than the O2 sensor or Flu can meaningfully sample anyways, so we have not been too bothered about being fussy about this. O2 sensors have a many-second temporal smoothing, so sampling them fast rarely helps anything, and optics are better heavily averaged as well. If you have other instruments that may be a poor assumption, but I still think pyglider is flexible enough to help.

jklymak commented 4 months ago

Thinking about this a bit more, I think if you wanted to propose a code path that did time_base='union' or something like that, that would be fine. I would be opposed to it being the default, however, and it should be pretty clearly documented. Still, if it were me, I'd make as many "raw" data files as I needed and then merge as scientifically relevant.

smwoodman commented 4 months ago

Ahh, I hadn't taken that step in my head to have a separate file for each instrument. Thanks.

I'm probably still going to explore some time_base='union'-type route. I'll reply here when I have more feelings about that.

smwoodman commented 4 months ago

@jklymak is https://github.com/smwoodman/pyglider/tree/return-nan a path that you'd be interested in for pyglider?

jklymak commented 4 months ago

@smwoodman hard to tell from a bare repo. Can you make a pull request so I can see the diffs?

smwoodman commented 4 months ago

Ha, yep will do. My bad, I'm new to this contributing thing.

smwoodman commented 2 days ago

Belatedly closing this issue, as we've moved forward with creating multiple (engineering and science) NetCDF files. Thanks for all of the thoughtful discussion.

jklymak commented 2 days ago

Thanks @smwoodman - hopefully the solution you came up with works for you and your group. If not, we can revive...

c-proof / pyglider

Interpolation and time base choices when processing slocum data #170