seaexplorer.raw_to_timeseries() outputs questionable navigation data

clayton33 commented 3 months ago

We're hoping to include heading, pitch, and roll in our data files. Below is an example figure comparing the processed data of these variables with the raw data.

processdVsRaw_navVars_731

I've deduced that the navigation data gets interpolated to the chosen timestamp using _interp_gli_to_pld. Is it attributed to the more coarse sampled navigation data being interpolated to the finer science data time ?

Also, to be honest, I'm not much of a python person, so I might be a bit useless to help if this is indeed an issue.

jklymak commented 3 months ago

That's certainly "interesting". Can you share the raw files that made those, and maybe your deployment.yaml file?

jklymak commented 3 months ago

Ping @callumrollo too in case this twigs anything for him. At some point we started using polars for the sea explorer and I didn't follow all the changes super closely.

callumrollo commented 3 months ago

Definitely not the behavior we would want! I remember some small changes in interpolated values when we switched from pandas to polars, but they were of the order 0.001 in the tests, nothing of this magnitude. I'll investigate this afternoon. The deployment.yaml file would be really useful, particularly the timebase

clayton33 commented 3 months ago

Just sent both of you a subset of the raw files, as well as the deployment.yaml in an email. Thanks again for having a look.

callumrollo commented 3 months ago

Hi @clayton33 thanks for sending the raw files. I think I've found the solution. As you described in your email, this mission did not sample science sensors on every row, so some data were discarded during processing. In the deployment yaml, you have:

  keep_variables:
  - conductivity
  - FREQUENCY_DOXY

so pyglider only keeps lines that have ctd or oxygen data. If you want it to keep data from the dives where no science was sampled, you can add one of the glider attitude parameters like roll

  keep_variables:
  - GLIDER_ROLL
  - conductivity
  - FREQUENCY_DOXY

When I made this change and reprocessed, it fixed the gaps in pitch/roll/heading, increading the timeseries file size by ~ 10 %

Does this solution work for you?

PS thanks for providing a fully reproducible example! It really sped up the troubleshooting process.

callumrollo commented 3 months ago

Using the following code to compare raw vs pyglider data:

import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt

ds = xr.open_dataset('L0-timeseries/dfo-Mersey022-20210531_delayed.nc')
df = pd.read_csv('raw/sea022.66.gli.sub.445', sep=';')
df['datetime'] = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M:%S')

fig, ax = plt.subplots(3, 1, figsize=(8, 12), sharex='col')
ax = ax.ravel()
ax[0].plot(df.datetime, df.Heading,  lw=5, label='Raw')
ax[0].plot(ds.time, ds.GLIDER_HEADING,label='Processed')
ax[0].legend()
ax[0].set(title='Heading', xlim=[df.datetime.min(), df.datetime.max()])

ax[1].plot(df.datetime, df.Roll,  lw=5, label='Raw')
ax[1].plot(ds.time, ds.GLIDER_ROLL,label='Processed')
ax[1].legend()
ax[1].set(title='Roll', xlim=[df.datetime.min(), df.datetime.max()])

ax[2].plot(df.datetime, df.Pitch,  lw=5, label='Raw')
ax[2].plot(ds.time, ds.GLIDER_PITCH,label='Processed')
ax[2].legend()
ax[2].set(title='Pitch', xlim=[df.datetime.min(), df.datetime.max()])

With original deplyment.yaml

old

With GLIDER_ROLL added to keep_variables:

new

jklymak commented 3 months ago

@callumrollo Can this be better documented? For instance, I don't see any mention of keep_variables in any of the test yaml files, nor in the docs. I'll admit I'm somewhat flummoxed what this does - I think it is perhaps similar to what was being discussed for slocums, where folks want multiple time bases?

callumrollo commented 2 months ago

Yes it's not the most intuitive control. keep_variables works as an accessory to timebase, to give more fine-grained control over what lines pyglider keeps. I'm off on vacation today, but I'll make a note to update the docs to describe it when I get back

clayton33 commented 2 months ago

I now realize in my initial issue I should have stated the expected behavior (sorry ! I should know better).

I was expecting that the pitch, roll, and heading variables would be interpolated to the time of the data retained by the supplied keep_variables since . For my example data set (which I initially thought wasn't a good case, but I actually think it's the ideal one!), the goal is to keep as much science data (i.e. from the ctd and oxygen sensors). Is this possible, or is the only way to get sensible navigation data is to supply a variable to keep_variables (which is completely fine I think for now), and maybe do some post-processing (e.g. omit any data where there isn't science payload data).

jklymak commented 2 months ago

@clayton33 Your intituition about how it is supposed to work was the original intention - I'm not entirely sure why that has changed.

We had a similar discussion about the slocum gliders, and my recommendation to folks is rather than keeping two timebases, for instance if they want all the heading data even for times when science is off, is that they make two separate netcdf files.

@callumrollo I'd call this a bug the way it is behaving now. I'm happy to try and fix it, or wait until you are back fro vacation.

clayton33 commented 2 months ago

deployment.zip

@callumrollo I tried your suggestion (adding one of the nav variables to keep_variables) and I'm still getting non-sensical pitch, roll, and heading values. I made sure to update all packages, and incorporate all recent changes in my fork... but still no luck. I've attached the updated deployment.yml if it helps.

richardsc commented 2 months ago

Hi folks -- just jumping in here that I tried running @clayton33's files through the latest version of pyglider, installed by following the instructions at https://pyglider.readthedocs.io/en/latest/Install.html, i.e. I did:

conda create -n gliderwork
conda activate gliderwork
conda install -c conda-forge pyglider

Followed by:

git clone git@github.com:richardsc/c-proof/pyglider.git

and then:

pip install -e .

When I tried running her process_delayed.py script, it errored with a "no module named polars" error. Indeed, polars had not been installed to the environment. After adding it to the environment manually (conda install polars), everything ran:

Downloads/GLD2021SEA022M66/$ python process_delayed.py
Created./rawnc/
Created./delayed_rawnc/
Created./L0-timeseries/
Created./L0-profiles/
Created./L0-grid/

and after running @callumrollo's plot script I get the expected result:

So, I think the issue here is not with pyglider but with @clayton33's environment or installed version of pyglider.

I'll open another issue related to the install of polars (I thought that it would have been taken care of by the environment.yml file).

clayton33 commented 2 months ago

@richardsc was right, it was an issue with my environment, disregard my comment above. It's working perfectly fine for me now.

jklymak commented 2 months ago

Ok sorry for the packaging issues. I think the folks who use pyglider mostly use from source rather than relying on package managers because changes are still happening fast. I should update the docs and maybe try to make automated releases easier

richardsc commented 2 months ago

No worries Jody -- @clayton33 and I are both learning as we go, as we're not primarily "python people". We definitely are using the source versions of pyglider, but also relying on conda to install all the other dependencies, and sometimes things can get weird. The lesson we learned from this is "don't try and do everything in your base environment" 😄 .

I won't close the issue, because the last request for work was to update the docs and @callumrollo is currently on vacation. Or we could close this and open a new issue that better reflects what still needs to be done (as the original issue is completed).

c-proof / pyglider

seaexplorer.raw_to_timeseries() outputs questionable navigation data #188