variableOffsetPP bug - Githubissues

BecCowley commented 3 years ago

When re-processing some older data, I have an instrument which didn't switch on until some time after deployment. When I used the variableOffsetPP routine, it failed at line 194 for this instrument because timeForDiff is set to the time _deployment_start value.

Looking at the comments in this code, it does suggest that the timeForDiff value comes from 'differences over a day from ¼ of deployment'. So I'm not sure why the timeForDiff is then reset to time_deployment_start.

I have commented out lines 175-188. I think the time selection step for getting this information could be problematic in many instances when an instrument fails or has no data on that particular day.

The gui window that pops up with a summary of the min/max/mean and differences from adjacent instruments could do with some work. It is confusing (min and max and mean appear to be for the whole deployment while the diff is for just the day ¼ of the way in). I'm not sure of the value of this information from my perspective.

However, the routine is useful for correcting offsets and drifts.

We could avoid all potential bugs with the time period selection by not including it. How many others use this routine?

ocehugo commented 3 years ago

So I'm not sure why the timeForDiff is then reset to time_deployment_start.

The code clearly assumes time_deployment_start == time_turned_on. Maybe it was the case before. Your solution is fine, but will still break if the instrument is only turned on after 25% of the time series. Likely never the case but still possible to raise an error.

The gui window that pops up with a summary of the min/max/mean and differences from adjacent instruments could do with some work. It is confusing (min and max and mean appear to be for the whole deployment while the diff is for just the day ¼ of the way in). I'm not sure of the value of this information from my perspective.

I completely agree. I would ignore all the information in the UI window. The logic for the difference here doesn't make any sense. I wonder if we should get rid of it.

In particular, I can't fathom how one would set an offset/scale - for the whole time series - using as an informational basis the difference of two instrument values, over a day of sampling, which is arbitrarily set to a day after 25% of the series.

This is even worse because the UI is not interactive and if you play with the offset/values nothing happens, so no clue about the actual transformation outcomes until you got to the main window.

My rationale of why this routine exists is for the basic transformation of data or simple recalibration (e.g. CPHL), where people already fitted a model to the data and know well enough what they are doing (they already visualized, fitted a model, etc). Thinking a bit more, even the min/max/mean are useless here since they cannot be used alone to define any linear parameter. For example, even in the simplest case - say we know a variable cannot be negative -to just choose an offset to get rid of negative values is a dubious step. You need more information, more interactivity, more visualisation to make the decision.

Hence, most of the information in the window is incomplete at best.

BecCowley commented 3 years ago

I am using it to apply a depth offset to aquadopp instruments where I don't have enough other instruments with good pressure to infer a depth. I have to already know what the offset is that I want to apply. To do that just using the toolbox, I have to start it and get to the point where I can plot the instrument depth against planned depth, then I can get an offset. I then have to re-start the toolbox to apply the offsets to the PP routine.

Getting the timeDriftPP values right also requires a few restarts to make sure the correct values are applied in the PP routine.

ocehugo commented 3 years ago

I am using it to apply a depth offset to aquadopp instruments where I don't have enough other instruments with good pressure to infer a depth. I have to already know what the offset is that I want to apply. To do that just using the toolbox, I have to start it and get to the point where I can plot the instrument depth against planned depth, then I can get an offset. I then have to re-start the toolbox to apply the offsets to the PP routine.

Exactly. On one hand, when you need to use the routine, you already loaded the variables, and you already know the parameters by using/fitting the data somewhere else. In this case, the UI information is useless - you already made a decision. On the other, when you don't know yet that you need to scale the variable (first load), the information is useless too because you can't decide just on that information. This is even worse, given the data state is close to raw (e.g. include out of water data).

Getting the timeDriftPP values right also requires a few restarts to make sure the correct values are applied in the PP routine.

Yeap - I know your pain :)

We shouldn't allow data transformation before looking at the data or doing quality control. The workflow related to these PP routines is clearly broken. The right thing to do is:

[ ] All PP goes to the main window and you can redo/overwrite PP (medium effort)
[ ] You can fit arbitrary statistical models to any variable, using any other variable (linear/polynominal/etc), and generate new or overwrite variables (large effort).
[ ] All the pp functionality is adapted to that, some deprecated, other extended (probably a large effort).
[ ] All the visualisation related to it (plots with variable against variable, model fitting, etc) (medium effort).

I have raised this in the past, but it's always hard to commit to it given the backlog of things and the difficulty to estimate the effort required.

aodn / imos-toolbox

variableOffsetPP bug #734