thomas-muench commented 5 years ago

This is a summary issue to list the quality control information piccr is supposed to deliver; both in the stand-alone version as well as when used along with Cpt-Picarr.

These (and more detailed) information are also available from the dataProcessingGuide.Rmd.

Quality control of the measured and processed data is requested to be possible with piccr on three levels: on (1) a per-injection basis, (2) a per-sample basis, and (3) a per-file basis.

Per-injection basis

On this level, the following quality control information is obtained directly from the raw Picarro measurement files:

standard deviation of the injection value for each measured isotopic species from the measurement integration time in the cavity;
water vapour level (mean and standard deviation) during the injection.

Per-sample basis

The final isotopic value of a sample or standard is obtained from averaging across a certain (settable) number of injections with the following quality control information:

the mean water vapour level and its standard deviation for this sample/standard;
the standard deviation of the mean isotope value for each isotopic species of this sample/standard;
the deviation from the true isotope value (only for standards).

Per-file basis

On the per-file basis, piccr should provide the following quality control information:

the specific deviation from the true value for the independent quality control standard;
the root mean square deviation across the deviations of all measured standards from their true values;
the pooled standard deviation across all measured samples and standards;
quality control information from the memory correction process (memory coefficients: mean values and values for each analysed standard);
quality control information from the calibration process (such as standard deviation of residuals, p-value of regression) [tbd];
quality control information from the drift correction process (drift slopes, p-value of regression) [tbd].

Definitions

The root mean square deviation is defined as

grafik

where the delta term is the deviation for standard i from its true value and k ist the number of standards.

The pooled standard deviation is the square root of the pooled variance which is defined as

grafik

where \sigma_i is the standard deviation of the mean isotope value for vial i from averaging across n_i injections, and N is the total number of analysed vials.

Measurement uncertainty for project

The overall measurement uncertainty for a given measurement project, which contains M single measurements (i.e. Picarro files), can then be assessed with the root mean square deviation of the quality control standards,

grafik

where the delta term is the deviation for measurement file i of the quality control standard from its true value (Note: if each measurement file has several control standards, one can use each file's root mean square deviation from these control standards here).

thomas-muench commented 5 years ago

This issue is related to:

15

EarthSystemDiagnostics/cpt-picarr#39 EarthSystemDiagnostics/cpt-picarr#13 EarthSystemDiagnostics/cpt-picarr#12 EarthSystemDiagnostics/cpt-picarr#11

twollnik commented 5 years ago

@thomas-muench Thank you for putting together this detailed summary.

It would be great if you could be involved in the implementation. Maybe you can take care of the picrr side of things (making sure that piccr outputs all required information). Then I could work on integrating the output into cpt. picarr. Does that work for you?

thomas-muench commented 5 years ago

We decided to first make a short meeting to set up a general piccr output structure for the quality control information. Then I will work on the piccr implementation of this output, and @twollnik will prepare cpt-picarr to handle it.

thomas-muench commented 5 years ago

We decided on the following general output structure:

Output for individual data set

Data output

The measurement data output will include the following components:

name: the file name of the data set [character vector]
raw: the original measurement data before any processing was done [data frame];
memoryCorrected: the data after applying the memory correction [data frame];
calibrated: the data after applying only a calibration using first-block standards [data frame];
calibratedAndDriftCorrected: the data after applying a linear drift correction and a calibration using first-block standards, or after applying a double calibration (with inherent drift correction) [data frame];
processed: the final data from averaging across n injections [data frame].

Quality control output

The quality control output, as outlined above, splits into information delivered along with the raw or processed data, and into separate information:

data-delivered:
- water vapour level information
- SD of mean isotope values
separate information:
- deviationsFromTrue: data frame with the deviations from the true values for all measured standards = columns Identifier 1, block, d18OMeasured, d18OTrue, d18ODeviation and the same last three columns for dD
- deviationOfControlStandard: named list (components d18O and dD) with the deviation from the true value for the quality control standard for d18O and dD
- rmsdDeviationsFromTrue: named list (components d18O and dD) with the rmsd of deviationsFromTrue for d18O and dD
- pooledSD: named list (components d18O and dD) with the pooled standard deviation for d18O and dD for the data set
- memoryCoefficients: data frame with memory coefficients = mean and indivdiual values for each analysed standards = columns Inj No, mean, <standard-name1, ..., each for d18O and dD
- calibrationParams: data frame with columns block = standard block used for calibration, pValue = p-value of the calibration regression, d18ORMSDOfResiduals, dDRMSDOfResiduals = RMSD of calibration regression residuals, d18OSlope, dDSlope, d18OIntercept, dDIntercept = slope and intercept of calibration for both isotope species, timeMean = mean measurement time since start for this block
- [only for calibration_method=1] driftParams: data frame for drift parameters with columns variable = name of the standard used for the estimate or the mean estimate, d18OAlpha, dDAlpha = estimated drift rates, pValue = p-value of the linear drift regression, d18ORMSDOfResiduals, dDRMSDOfResiduals = RMSD of drift regression residuals (p-value and RMSD values are NA for the mean estimate)

Output for M data sets

The output for M processed data sets is a list of length M, where each list element i is a list containing all of the above output.

@twollnik Please have a look over this proposed structure and tell me if you are fine with it or if any changes (e.g. variable names) are necessary from your point of view.

twollnik commented 5 years ago

@thomas-muench Thank you for taking the time to write this. I think that you have captured everything that we talked about (and more). I have one question and some remarks.

The question: Under quality control output >> seperate information you mention named vectors a few times. What names would you choose? (e.g. names(qualityControl) equals what?)

The remarks:

I suggest adding a component name that contains the file name of the raw dataset. That way we can output the actual names of the input datasets and not just file numbers to make the output more understandable.
I suggest to stick to our camelCase naming convention and not use underscores or points in the component names. (This applies to deviationsFromTrue, rmsd.DeviationsFromTrue, calibrationParams, and driftParams.

twollnik commented 5 years ago

Things we should not forget

[x] writeDataToFile(..) needs to be updated to be able to work with the new format.
[x] outputSummaryFile(..) needs to be updated to work with the new format and to include more quality control information.
[x] The roxygen docstrings need to be updated for all functions that are changed.
[x] The tests should be updated continously. They should always be green.

thomas-muench commented 5 years ago

@twollnik Thanks for your feedback.

I have updated the comment to adopt a consistent camelCase naming convention and included the name parameter.

Regarding your question: What I meant was in each case d18O and dD as names for the vectors, since the respective quantities are all one numeric value for each isotope species. Or should we rather instead use named lists to be more consistent?

twollnik commented 5 years ago

Thanks for integrating my feedback.

Or should we rather instead use named lists to be more consistent?

Yes, good idea.

thomas-muench commented 5 years ago

Ok, thanks, I will edit the structure to also use lists for the respective quantities.

twollnik commented 5 years ago

@thomas-muench I suggest renaming the component qualityControl to controlStandardDeviation to be more precise.

twollnik commented 5 years ago

@thomas-muench I added the file piccrMockOutput to the cpt picarr repository. You can download the file and then execute load("path/to/piccrMockOutput") to load the variable piccrMockOutput into your workspace. It contains example output in the format that cpt picarr expects. (Note that some values are NULL)

thomas-muench commented 5 years ago

@twollnik Thanks for the mock variable; this is helpful.

@thomas-muench I suggest renaming the component qualityControl to controlStandardDeviation to be more precise.

I would not use this name since it sounds like the "control standard deviation (SD)" value. Instead, how about deviationControlStandard or deviationOfControlStandard?

twollnik commented 5 years ago

I like deviationOfControlStandard best.

thomas-muench commented 5 years ago

I like deviationOfControlStandard best.

Updated accordingly.

twollnik commented 5 years ago

@thomas-muench

the mean water vapour level and its standard deviation for this sample/standard;

At the moment this information is not included in the processed data. Would you prefer to include two columns H2O_Mean and H2O_SD in the processed data or should this information be calculated by cpt picarr?

thomas-muench commented 5 years ago

Good point; I would have cpt picarr to calculate this.

thomas-muench commented 5 years ago

@twollnik Should I initialize the output structure such that it always contains all possible elements but certain elements might be NULL when a specific processing step was not switched on (e.g. memoryCoefficients, ...)?

I also would suggest to rename pooledStdDev to pooledSD, since SD is the common abbreviation for the standard deviation.

twollnik commented 5 years ago

@twollnik Should I initialize the output structure such that it always contains all possible elements but certain elements might be NULL when a specific processing step was not switched on (e.g. memoryCoefficients, ...)?

Yes, please.

I also would suggest to rename pooledStdDev to pooledSD, since SD is the common abbreviation for the standard deviation.

I agree.

thomas-muench commented 5 years ago

Left to do here:

[x] output calibration parameter in calibrationParams
[x] output linear drift regression parameter in driftParams

thomas-muench commented 4 years ago

Finally completed and closed by #53.

EarthSystemDiagnostics / piccr

Overview of needed quality control information #17

Per-injection basis

Per-sample basis

Per-file basis

Definitions

Measurement uncertainty for project

15

Output for individual data set

Data output

Quality control output

Output for M data sets

Things we should not forget