hvdosser commented 1 year ago

In testing the new code for this pull request, I found an issue with processing the delayed mode SeaExplorer time series data for missions for which certain sensors (oxygen in this case) are severely oversampled. These missions end up with delayed mode data files that contain fewer actual (non-nan) data points than the realtime files. In other words, we are losing data during the processing.

Currently, the dropna function is used to remove the oversampled oxygen data when converting the raw data. The dropna function is working correctly, however note that the resulting data has many nan values in it, for both the CTD and optics. These nan values will often not co-occur.

I think the problem in the processing is caused by using the GPCTD_TEMPERATURE as the default time base in seaexplorer.py. This variable contains nan values that are not all co-located with the nan values in the oxygen and optical variables. It's desirable to use the CTD as the time base, but we may need to do some interpolation to avoid losing data when the other variables are mapped onto this base.

callumrollo commented 1 year ago

We have had similar issues with our delayed mode datasets in the past. To get around it, we use NAV_LATITUDE as the timebase in all of our datasets, as this has a non-nan value for every line. Here's an example yml for one of our gliders with a GPCTD. https://github.com/voto-ocean-knowledge/deployment-yaml/blob/master/mission_yaml/SEA70_M15.yml

Would this solution work for you?

This also raises the broader point that perhaps we should have a test with delayed mode data, as currently all tests are run with nrt data

callumrollo commented 1 year ago

We use NAV_LATITUDE as timebase in conjunction with keep_variables that account for each of the sensors. In this way, all data pass the dropna function, but rows with no data from any of the sensors are dropped by the 'keep_variables' in ncvar block.

richardsc commented 1 year ago

I'm not sure if this is related, but I had previously noticed that the timestamps recorded for sensors that are known to be outputting at 1Hz (according to the sensor) are never exactly 1Hz, likely due to the time that they arrive at the payload computer -- in other words, timing information from the sensor is not recorded by the PLD but it assigns it's own. The result is that two different sensors sampling at the same rate (e.g. a CTD and an optical O2 sensor) end up with times that are different by microseconds from one sample to the next, and even though we expect them to be "simultaneous", in terms of the recorded time stamps, they aren't.

jklymak commented 1 year ago

For our data, we have the situation where indeed "Nav" is written every heartbeat; certainly though there is no new data in Nav. Some heartbeats have no data at all, some have CTD data (all variables), some have Optical, some have O2. Sometimes these line up, and currently we only save when the other sensors line up with the CTD sensors.

I think what needs to happen is that for each instrument we need a timeseries, and then we need to align it with the timebase. For our SeaExplorers, that time base is most naturally the CTD. The other samples may be offset from that a it, but the CTD is sampling at 2 Hz, and I don't think we care about the other sensors slight phase errors at less than 2 Hz.

I haven't explored how to do this with polars, but in xarray you would just do an interp operation before filling it into the parent. So in pseudocode:

time, ctd_temp = decode('ctd_temp', drop_na=True)
ds['time'] = ('time', time)
ds['temperature'] = ctd_temp
# etc other ctd variables
time_O2, O2_sat = decode('O2_sat', drop_na=True)
ds['O2'] = np.interp(time, time_O2, O2_sat)
...
# etc other O2 variables

I don't think this step would slow things down very much, and I think linear interpolation should be pretty OK from a data point of view.

A second option would be just to save the three instruments as separate polars arrays as raw output and then merge as a second step. That would allow double checking the raw data. However, I think the raw SeaExplorer data is simple enough that it's pretty usable as-is for any debugging.

jklymak commented 1 year ago

I'll ping about this. Is there a consensus about how to proceed?

richardsc commented 1 year ago

I think what needs to happen is that for each instrument we need a timeseries, and then we need to align it with the timebase. For our SeaExplorers, that time base is most naturally the CTD. The other samples may be offset from that a it, but the CTD is sampling at 2 Hz, and I don't think we care about the other sensors slight phase errors at less than 2 Hz.

I agree with this approach, with the clarification (not really important for the decision or discussion) that I'm 99% sure the GP-CTD samples internally at a max of 1Hz. That's not true for a legato, which can be programmed to sample much faster (up to 16Hz), though the SeaExplorer PLD can only be configured to sample at 1Hz or "as fast as possible" (which IIRC is something like 20Hz)

jklymak commented 1 year ago

@callumrollo any objections to this? I can take a crack at doing it the next few days.

@hvdosser, do we have a link to some data where this is problematic?

hvdosser commented 1 year ago

The delayed L0-timeseries (dfo-bb046-20200908_delayed.nc) for this mission is a nice example of the problem.

jklymak commented 1 year ago

Is there a time, or ideally a set of raw files where this problem occurred? I couldn't replicate, though I couldn't get it to work at all with the setup in that directory.

callumrollo commented 1 year ago

Sorry for the delay on this, just got back from vacation. The idea of aligning timebases sounds good to me. Should we linearly interpolate, or do a nearest neighbour? I'm always a little cautious about interpolating data. Especially as it would lead to some strange results with integer data like raw counts for chlorophyll etc. If we want to do it at the raw to raw nc step I can look at implementing it in polars tomorrow

jklymak commented 1 year ago

OK, I ran the whole data set, and can see the problem now.

@callumrollo if you wanted to do this, happy to let you. I'm less nervous about linear interpolation - nearest neighbour leads to phase errors; However, if there is a difference for other data types, maybe we need an entry in the yml metadata that says which interpolation is used?

callumrollo commented 1 year ago

The tests are currently failing on the main branch. Looks like this way caused by something in #129. That PR was only for slocum data, so I'll work from the last commit where all the tests cleared for this PR. I've downloaded the dataset that @hvdosser indicated and will use that as a test case for timestamp alignment

jklymak commented 1 year ago

@callumrollo #129 included sea explorer tests as well, so better to figure out why the tests are failing?

jklymak commented 1 year ago

@callumrollo I think the tests should be OK on main now (I hope)

I think we need to do some work on the test infrastructure. In particular, we should force rewriting of the files we compare with. Right now the way the processing works is it does incremental.

callumrollo commented 1 year ago

Thanks for working to get the tests passing again @jklymak. I'll start on resolving this Issue now.

I agree on forced reprocessing in the tests. would using the incremental=False flag suffice for this?

callumrollo commented 1 year ago

@hvdosser I think the keep_vars functionality present in pyglider already could solve this problem to first order. Is it something you're using?

I've put a demo together of the difference between using CTD as a timebase and using NAV_LATITUDE as a timebase then cutting down to the rows where at least one of the sensors has a sample. It's not perfect, but has been working pretty well for us so far.

https://github.com/callumrollo/keep_vars_experiment/blob/main/timebase_keep_experiment.ipynb

Sorry, it's a bit of a rushed Friday afternoon job!

jklymak commented 1 year ago

OK, I forgot about this, and I'm not sure it's documented. What does this do exactly?

hvdosser commented 1 year ago

I haven't had a chance to look in detail yet, but the first thing I'd check is whether this works for a delayed-mode dataset. We didn't see much of an issue with the realtime data.

jklymak commented 1 year ago

looking at it quickly it seems to keep the data if any of the listed sensors are present in a line.

I guess this is a philosophical thing - do we want all the raw data in a time series, which means any given sensor is riddled with NaN, or do we want time series where the sensors are time-aligned to one sensor's time base.

I guess I'll argue for the latter. If someone needs the time series with raw O2, for instance, they can rerun the processing with just that variable, and that variable as the timebase. Or they can load the raw parquet files. I think by the time we make the first time series netcdf, it should be time-aligned, and not full of NaN's.

This seems particularly apropos for the O2 and the optics on the SeaExplorer, which so far as I can tell are often ludicrously oversampled?

But happy to discuss further.

callumrollo commented 1 year ago

II agree we need a way to reduce the size of the final timeseries while keeping as much science data as possible.

I've been white-boarding this morning to represent the way pyglider currently does this for seaexplorer data, and what potential improvements could look like. Here's some ASCII art:

Key Time: PLD_REALTIMECLOCK nav: NAV_LATITUDE ctd: LEGATO_TEMPERATURE or GPCTD_TEMPERATURE oxy: e.g. AROD_FT_DO nitrate: e.g. NITRATE_CONCENTRATION or other rarely sampling sensor

X: original data -: no data I: interpolated data N: nearest neighbor data

Input pld1 file

Time	nav	ctd	oxy	nitrate
0	X	X	-	X
1	X	-	X	-
2	X	-	X	-
3	X	X	-	-
4	X	-	X	-
5	X	-	X	-
6	X	X	-	-
7	X	-	-	X
8	X	-	X	-
9	X	-	X	-
10	X	X	-	-

Current output options

timebase: ctd

Time	nav	ctd	oxy	nitrate
0	X	X	-	X
3	X	X	-	-
6	X	X	-	-
10	X	X	-	-

This illustrates @hvdosser's problem where oxygen is on a slightly offset timestamp to ctd so it is all dropped

timebase: nav, keep_vars: [ctd, oxy]

Time	nav	ctd	oxy	nitrate
0	X	X	-	X
1	X	-	X	-
2	X	-	X	-
3	X	X	-	-
4	X	-	X	-
5	X	-	X	-
6	X	X	-	-
8	X	-	X	-
9	X	-	X	-
10	X	X	-	-

With keep_vars, as @jklymak identified, any line with a non-nan value for one of the sensors in keep_vars is kept. Resultant files are big, particularly if one of the keep_vars has a high sampling rate.

This is the method we currently implement at VOTO, as the scientists we supply data to don't want any interpolation of data. Some of our delayed mode datasets are almost 10 GB though! Not ideal.

Potential improvements

timebase:ctd linear interpolation

Time	nav	ctd	oxy	nitrate
0	X	X	I	X
3	X	X	I	I
6	X	X	I	I
10	X	X	I	I

This solves the problem nicely for oversampling sensors like oxygen. However, this would be a severe distortion of e.g. a methane sensor that records a sample every 10 minutes and now has an apparent sample every second

timebase:ctd nearst neighbour

Time	nav	ctd	oxy	nitrate
0	X	X	N	X
3	X	X	N	-
6	X	X	N	-
10	X	X	N	N

This avoids over-interpolation of slow sampling sensors, but the downsampling of more fast sampling sensors may be less preferable. May also be more complex to implement.

Whatever we decide to do going forward, I recommend that it is either controllable from the yaml, like the keep_vars solution, or operates on the l0 timeseries as an extra step, so that an end user can get the final l0 timeseries without any interpolation if they want.

We should also explain these various options in the documentation. Perhaps using diagrams like the ones in this comment.

jklymak commented 1 year ago

Folks who don't want any interpolation or alignment of sensors have two options (that I agree should be documented)

1) I would argue that the raw merged parquet file is what the folks who don't want any interpolation are after. That, to my knowledge, doesn't drop any data or do any processing?

2) They can make a oxy.yaml that aligns everything to oxygen instead of the ctd, and then they have the best of both worlds - pure oxygen data, and interpolated quantities of everything else.

For interpolation options, totally fine with both linear and nearest. I'd never use nearest, but...

The original reason for using the CTD for our data sets was that indeed oxygen had much more data, but it was all repeated and was really being sampled at 1 Hz or 2 Hz or something, and seemed a silly error of Alseamars to be sampling it at 48 Hz or whatever they were doing. Of course if someone has a real data set that needs sampling at higher frequency than the CTD, they should be using that as the timebase.

callumrollo commented 1 year ago

This sounds like a good way to implement the interpolation. I've started working on a PR.

c-proof / pyglider

SeaExplorer delayed mode time series data loss #128

Input pld1 file

Current output options

timebase: ctd

timebase: nav, keep_vars: [ctd, oxy]

Potential improvements

timebase:ctd linear interpolation

timebase:ctd nearst neighbour