c-proof / pyglider

glider software
https://pyglider.readthedocs.io/
Apache License 2.0
17 stars 25 forks source link

interpolate over nans of pld sensors #140

Closed callumrollo closed 1 year ago

callumrollo commented 1 year ago

This will fix Issue #128

This is a first draft. If interpolate is set to True in raw_to_timeseries, nans from variables will be interpolated over as the sensors are aligned with the timestamps of the variable designated as the timebase.

This should be tested with some delayed mode data, as the interpolation has no effect on nrt data, as these typically have no nans. @hvdosser @jklymak, do you have a delayed mode dataset you would like to use to test this method?

callumrollo commented 1 year ago

We can make the interpolate kwarg True by default, or control this behavior from the yaml

codecov[bot] commented 1 year ago

Codecov Report

Base: 78.48% // Head: 79.47% // Increases project coverage by +0.98% :tada:

Coverage data is based on head (6205c43) compared to base (827f7f7). Patch coverage: 94.59% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #140 +/- ## ========================================== + Coverage 78.48% 79.47% +0.98% ========================================== Files 7 7 Lines 1292 1364 +72 ========================================== + Hits 1014 1084 +70 - Misses 278 280 +2 ``` | [Impacted Files](https://codecov.io/gh/c-proof/pyglider/pull/140?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=c-proof) | Coverage Δ | | |---|---|---| | [pyglider/seaexplorer.py](https://codecov.io/gh/c-proof/pyglider/pull/140?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=c-proof#diff-cHlnbGlkZXIvc2VhZXhwbG9yZXIucHk=) | `81.91% <75.00%> (+0.19%)` | :arrow_up: | | [tests/test\_pyglider.py](https://codecov.io/gh/c-proof/pyglider/pull/140?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=c-proof#diff-dGVzdHMvdGVzdF9weWdsaWRlci5weQ==) | `98.17% <100.00%> (+1.00%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=c-proof). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=c-proof)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

hvdosser commented 1 year ago

Hi @callumrollo, I'd recommend testing on this delayed mode dataset: https://cproof.uvic.ca/gliderdata/deployments/dfo-bb046/dfo-bb046-20200908/delayed_raw/ I would suggest controlling the behaviour from the yaml, so users really know what they are getting out of the processing, but I defer to @jklymak on this one.

callumrollo commented 1 year ago

Thanks @hvdosser. Can you send me a yaml file for the dataset you linked? I will then incorporate the nan interpolation with control from the yaml file.

Sorry for the delay on this, just back from vacation

hvdosser commented 1 year ago

Hi @callumrollo, no worries, thanks for figuring this out! Here's the link to the yaml: https://cproof.uvic.ca/gliderdata/deployments/dfo-bb046/dfo-bb046-20200908/deployment.yml

callumrollo commented 1 year ago

@hvdosser do these tests look sufficient? I've added some of the data you provided and tested it in default mode and with nan interpolation enabled. We're using GPCTD as the timebase.The nan interpolation adds non-nan values for the other sensors (oxygen and cdom) where they did not sample at the same time as the CTD.

Getting a nice increase isn test coverage from this too :)

jklymak commented 1 year ago

Ping @hvdosser for a review on this

jklymak commented 1 year ago

@hvdosser if you had a chance to make sure this works it would be appreciated...

callumrollo commented 1 year ago

@hvdosser do you have a timeframe for reviewing this? It's quite a substantial change, so I'd like to get it merged before addressing other Issues

hvdosser commented 1 year ago

Hi @callumrollo and @jklymak , sorry for the delay on this. I'll have time on Monday Feb 13th to review and merge.

callumrollo commented 1 year ago

I've merged in the latest changes from main. Is this good to go @hvdosser?

hvdosser commented 1 year ago

@callumrollo and @jklymak, I'll merge this pull request, but I do think we need to consider what happens when we have large gaps in the data and we interpolate over them. Particularly for sensors like the ECOpuck, which only samples a small portion of the water column. Am I correct in thinking that setting interpolate=True will result in new interpolated points everywhere the CTD has data? If that's the case, we'll need to figure something out to deal with these larger gaps before we can move forward.

jklymak commented 1 year ago

OK, I think xarray provides this functionality:

https://docs.xarray.dev/en/stable/generated/xarray.DataArray.interpolate_na.html#xarray-dataarray-interpolate-na

note the max_gap parameter, which should be perfect for this. Just set to some reasonable value (that probably changes depending on the sampling frequency) and big gaps would remain NaN...

callumrollo commented 1 year ago

Hi @hvdosser. The potential problem with slow sampling sensors was raised in the original discussion of Issue #128. Linear interpolation will result in an apparent measurement at every timestamp of the specified timebase. using keep_vars will ensure these measurements are retained. An alternative would be to implement a max_gap as Jody suggests, or to have the option of nearest neighbout interpolation of some kind. Either way, I feel this is best addressed in a separate PR, preferably with the use of a dataset which demonstrates the issue you want to solve as test data