Data normalization, POS

jorainer commented 6 years ago

Perform feature data normalization on the positive polarity data. This should include:

quality assessment based on feature abundances (RSD, RLA)
injection order dependent signal drift adjustment (within-batch normalization), if required.
between batch adjustment.

jorainer commented 6 years ago

We base the analysis mostly on single-peak-features but include also features with multiple peaks per sample, if they are present in less than 10% of samples in which a peak was identified. Also, we require features to be found in more than 30% of samples of at least one sample group. This breaks down to:

	count
total_features	10740
RBC > 50%	4180
plasma > 50%	4531
capillary > 50%	4606
venous > 50%	5329

jorainer commented 6 years ago

The plot below shows the per-sample signal distribution of feature abundances in log2 scale

raw-nofill-boxplot-1

Plasma samples (blue) have less valid values for features compared to samples from other sources. Some of the RBC (red) samples also have a considerably lower number of detected features. The distribution of feature intensities seems to be comparable between samples, sources and batches. The high similarity of average abundances between samples visible above might however be misleading, since only signal from identified peaks are considered. Comparing signal intensities after filling-in missing peak data might be better, because it also reflects the fact that some features are simply not present in the data.

raw-boxplot-1

After filling in missing peak values the proportion of detected features is highly comparable between samples, but more differences between abundances are visible. Plasma samples (blue) show however considerably lower signal intensities than all other samples, which reflects that less peaks were detected for these features in these samples (and intensities were thus filled-in).

Below we show relative log abundance plots that represent the difference of the (log) abundances of each feature in a sample compared to the median abundance of that feature in samples from the same sample group (source).

raw-rla-plot-1

The RLA plot above shows some potentially systematic drifts in signal intensities that includes also consistently lower average RLA values in the last samples from the first, and consistently higher average RLA values in the last samples from the second batch. Next we plot the RLA plots after filling-in missing peak data.

jorainer commented 6 years ago

Just as an info: the distribution of the signal from detected (green) and filled-in (blue) peaks. Filled-in peaks can have quite high abundances, but are still lower than the detected peaks.

compare-detected-filled-plot-1

jorainer commented 6 years ago

The plot below shows the mean RLA (per sample) along the injection index splitted by source.

rla-split-source-plot-1

The trend seen on QC pools seems, to some extend, also be present in venous blood samples and plasma blood samples. RBC and capillary samples show a higher variability. Over and above it seems also that the samples from the second batch have slightly higher abundances.

jorainer commented 6 years ago

The injection index dependent signal drift is dependent on the batch/run, as can be seen by the very low correlation between slopes from the per-feature linear model fits that were performed separately for batch 1 and 2 below.

rplot

This means we will have to adjust the data with a model like y ~ inj_idx * batch (allows a batch-specific signal drift) instead of y ~ inj_idx + batch (assumes the signal drift to be the same in each batch). Note that this models adjust the signal drift and batch differences at the same time (for each individual feature).

jorainer commented 6 years ago

Fitting the model y ~ inj_idx * batch to the abundances for each feature. For this estimation of the injection index depedend signal drift and batch effect we are using the QC samples without filled-in data (thus just fitting on real detected peak data). The table below lists the numbers of feature for which the model could be fitted. It could not be fitted if there were less than 4 data points available, or if a feature had measurements in only one batch.

total_features	fitted	not_fitted
10740	6071	4669

jorainer commented 6 years ago

Now, fitting just on the detected peaks can create problems: the signal drift estimation can become exaggerated. Some examples are shown below: filled circles are detected peaks signals, open circles filled-in signals. Green are the QC samples on which also the fit was performed.

ft00012

ft08699

ft09128

All ~ 90 features with the steepest slopes show this same problem: the fit is performed through only few QC samples. What I also realized was that the filled-in signal is actually rather good. So I'm thinking now of a way to include the filled-in signals in the fit, eventually down-weighting them.

jorainer commented 6 years ago

Fitting on all QC values (including filled-in peaks) has also drawbacks, especially in cases where only filled-in signals are available. In the end we might use an approach that includes filled-in signals, but down-weights them and that skips fitting for features for which not at least 2 detected and 6 filled-in peaks are available within each batch (the numbers will have to be adjusted still).

jorainer commented 6 years ago

An update on filled-in peak signal: I calculated the ratio of abundances of detected against filled-in peak signal for each feature in QC samples:

rplot

For 75% of the features the detected peak signal is less than 4-fold higher than the filled-in peak signal. On average, the filled-in signal is about the half of the detected peak signal. This suggests that filling-in does to some extend represent signal from the ion, but mostly underestimates the signal.

jorainer commented 6 years ago

Including the filled-in signals in the model fits is still not ideal, even not with downweighting. They have a too strong influence if there are only few detected peaks.

jorainer commented 6 years ago

Since we're fitting anyway per-batch models it is simpler to perform the within-batch adjustment completely separate in each batch. In each of the two batches we fit the model y ~ inj_idx on the QC samples (for each feature). The data is fitted on the detected peaks (not the filled-in peaks). No adjustment will be applied if:

less than 4 valid measurements are present (in QC samples).
the values do not span at least 2/3 of the injection index range.
the difference between the 25 and 75% quantile of the residuals is > 1.

These criteria will eventually be refined after looking through linear model fits for features that were excluded or not excluded.

jorainer commented 6 years ago

Performance of the flagging criteria is done visually for features with the largest slopes (i.e. those that would be adjusted the most). With the criteria above I am not completely happy.

The fits below (for batch b on the right) would be considered OK. ft11767

ft00779

Common to such problematic fits is that the filled green points scatter quite considerably around the model fit. Also individual measurements have large residuals. The residual flagging seems to be too robust against such cases. Eventually a cut-off based on the mean absolute residuals might work better.

jorainer commented 6 years ago

The flagging on the injection order range seems to be OK (see example below that was flagged because of the values spanning a too narrow injection index range).

ft01431

SiggiSmara commented 6 years ago

What about for each feature comparing the within batch correction between batches and flagging corrections that are outside the norm? In theory the majority of within batch variation should be somewhat consistent as long as the instruments themselves are not the greatest source of the variation. Obviously this would only work for larger batches such as the CHRIS 5000 and maybe not for a set of two batches.

But we could also look at all runs for a specific method and a specific matrix (serum, whole blood, cells) as the data source to compare with even if the batches in question are not related experimentally.

SiggiSmara commented 6 years ago

Another thought: in the last example that you showed (FT01431) you can also see that the feature is rather well behaved in all runs. There are outliers but they should be rather easily detected. If you use the whole data set (not only the QCs) to determine (uncorrected) robust outliers in the QCs as an additional criteria, would that allow for using more of the filled in values?

SiggiSmara commented 6 years ago

For FT11767 there is something else going on than just the stability of the feature itself. You have initial high spread of all signals in b and then they seem to settle down into something approximating the behavior in a. For such a feature it would probably be better to take a look at the peaks it is detecting to see if some of the noise is coming from feature detection / integration and if not, try to figure out the reason for this "extra" noise.

SiggiSmara commented 6 years ago

And anything with a correction slope of more than 0.5 (or another user defined cutoff) should be flagged no matter what since that would indicate an unstable signal.

SiggiSmara commented 6 years ago

Sorry I keep coming up with new ideas. And if you compare the QC correction to a correction based on all samples? best case scenario they should not differ that much and if they do then that could also be a reason to flag the feature.

jorainer commented 6 years ago

What about for each feature comparing the within batch correction between batches and flagging corrections that are outside the norm? In theory the majority of within batch variation should be somewhat consistent as long as the instruments themselves are not the greatest source of the variation. Obviously this would only work for larger batches such as the CHRIS 5000 and maybe not for a set of two batches.

Nope, I don't think that works. Almost all features show different drifts between the two batches. Also, different features show completely different drifts. This all makes me question what the source of the drift might be. I was thinking eventually some metabolic process going still on (more or less) in the tubes or something like that? This would IMHO also explain why some metabolites show increasing levels.

jorainer commented 6 years ago

Your suggestions @SiggiSmara are well taken. Looking back at the actual identified peaks is however not an option, because we have about 10,000 features. In the end I want to have an approach that can be applied to all features accepting that some adjustments might be wrong, but over and above quality should improve.

The take home messages for now are:

there is an injection index dependent drift
it affects features differently (some show quite some drift, others not at all)
the drift is different in the two batches.

jorainer commented 6 years ago

Based on the figures above (and looking through other ~ 200 too) I tuned the flagging criteria, specifically the one based on the model residuals. Model fits are now flagged if one of the following criteria applies:

less than 4 QC data values in a batch.
the values do not span at least 2/3 of the injection index range.
mean of absolute residuals is larger than 0.5.

This results in:

	batch_a	batch_b
total features	10740	10740
large residuals	825	119
low inj idx range	929	607
valid model fits	4422	5405

About half of all detected features can/will be adjusted for a signal drift.

jorainer commented 6 years ago

To evaluate the performance of the within-batch normalization we use the abundances (from detected peaks) for replicated samples. We calculate the MRA (maximum ratio of abundances), which is simply the ratio between the maximum and minimum abundance of a feature measured in replicated samples.

For each replicate pair the 75% of all MRAs of (adjusted) features is calculated. 75% of features for a given replicate pair have an MRA smaller than this value. These 75% MRAs for the raw and adjusted data are then compared to evaluate the impact of the within-batch normalization.

jorainer commented 6 years ago

The plot of the raw and adjusted 75% MRAs is shown below. Each point represents one replicate pair rplot

The 75% MRA of replicated samples is between 1.2 and 2.5 which is actually quite good, meaning that most samples have a less than 2-fold difference between replicated measurements. Within-batch normalization had however no big impact on the MRA. This might be explained by the relatively moderate signal drift present in only few features (for most features the estimated signal drift slope was about 0 suggesting an almost absent signal drift).

The results are further summarized in the boxplot below which shows a tiny improvement of the 75% MRAs, especially considering the range and the 75% quantile.

rplot

jorainer commented 6 years ago

After the last meeting, some discussions and a lot of thinking I think we should first remove any sample processing related variances and then signal drift etc (see also issue #14). RUV might be a nice choice here, but since it bases on measurements from internal standard and these have been added after sample processing to the sample mix I'm also not quite sure if that will be able to remove sample processing related variances (see also issue #13).

Per-sample normalization

Approaches I'm going to compare are:

normalize by the total sum of signal.
normalize by the per-sample median signal.
median ratio method (MRM) from DESeq2 Anders et al
trimmed mean of M-values (TMM) from edgeR Robinson et al

Update: the normalization factors for each sample are estimated on only detected signals as that seems to outperform the estimation on also the filled-in data (not shown).

jorainer commented 6 years ago

To evaluate between sample normalization we can use the differences in abundances between the two replicated samples (which are independent of batch effects and, to some degree, injection order dependent signal drift). I've expressed the difference between replicates with the 75% quantile of MRA values for each replicate pair. This means that 75% of features for a replicate pair have an MRA (maximum ratio of abundances) smaller than this value. In addition I count for each replicate pair the number of features that have a more than two-fold difference in abundances.

The table below summarizes the results:

	raw	sum	median	MRM	TMM	RUV.rand	RUVIII	NOMIS
0%	1.232	1.287	1.239	1.238	1.271	1.32	1.51	1.286
25%	1.413	1.429	1.387	1.39	1.44	1.519	2.952	1.622
50%	1.609	1.606	1.583	1.603	1.624	1.787	5.612	2.001
75%	1.826	1.8	1.784	1.767	1.842	2.2	17.74	2.644
100%	2.579	2.739	2.5	2.496	2.7	4.903	222.9	7.634
count M > 1	541.1	544.7	512.9	512.8	551.6	792.2	2167	1006

The first 5 lines represent the quantiles of the 75% quantile MRA (the 50% is thus the median the 75% MRA quantiles across all replicates). The last row is the mean number of features with a more than two-fold difference in abundances.

What it essentially tells:

we can not really improve the raw data by between-sample normalization.
could also mean: raw data is already quite good.
approaches based on the internal standards (RUV, NOMIS) fail badly.
best approaches are median scaling or the MRM from DESeq2.

jorainer commented 6 years ago

The final strategy is now defined: 1) normalize between-sample differences (due to sample collection, processing etc). Here we use the median scaling that performed best; MRM performed similarly well, but all methods depending on internal standards were badly failing. 2) adjust injection order dependent signal drift. 3) adjust differences between batches. This is done per-metabolite or globally (if variance in QC samples is too large or too few valid measurements are available).

The improvement of the data is only very marginal, but I guess that's in part because the raw data is already quite OK.

jorainer commented 6 years ago

Averaging replicates

After normalization we want to average the normalized feature intensities. Here we can report, along with the averaged intensities, also a weight for each measurement:

1: signal from 2 detected peaks present and their difference is less than 50%. The mean of the two is reported.
0.9: 1 detected and one filled-in signal present and their difference is less than 50%. The mean of the two is reported.
0.75: signal for two detected peaks but their difference is more than 50%. The mean of the two is reported.
0.5: signal from 1 detected peak present (independently of whether a filled-in signal is present for the other replicate is present). Report only the value for the detected peak.
0.25: only filled-in signal is present.

@SiggiSmara , @cvolani, any comments?

jorainer commented 6 years ago

Instead of the difference between abundances being <> 50% we could also simply use an RSD <> 30%.

SiggiSmara commented 6 years ago

Makes sense to me.

Averaging replicates

After normalization we want to average the normalized feature intensities. Here we can report, along with the averaged intensities, also a weight for each measurement:

1: signal from 2 detected peaks present and their difference is less than 50%. The mean of the two is reported.

0.9: 1 detected and one filled-in signal present and their difference is less than 50%. The mean of the two is reported.

0.75: signal for two detected peaks but their difference is more than 50%. The mean of the two is reported.

0.5: signal from 1 detected peak present (independently of whether a filled-in signal is present for the other replicate is present). Report only the value for the detected peak.

0.25: only filled-in signal is present.

@SiggiSmara , @cvolani, any comments?

jorainer commented 6 years ago

I am now favouring flagging using the RSD. Does not make much sense on only two values, but metabolomics people seem to be used to the RSD - so not much to explain there.

SiggiSmara commented 6 years ago

I was thinking the same (not much sense in RSD for two values) but either way is fine IMO.

jorainer commented 6 years ago

The final weight definition is:

1: signal from 2 detected peaks present and the RSD of their abundances is < 30%. The mean of the two is reported.
0.9: 1 detected and one filled-in signal present and their RSD is < 30%. The mean of the two is reported.
0.75: signal for two detected peaks but their RSD is >= 30%. The mean of the two is reported.
0.5: signal from 1 detected peak present (independently of whether a filled-in signal is present for the other replicate). Report only the value for the detected peak.
0.4: only filled-in signal is present, but their RSD is < 30%. The mean of the values is reported.
0.25: only filled-in signal is present. Mean or single value reported.

The average number of features (across all samples) with a specific weight is shown below

0	0.25	0.4	0.5	0.75	0.9	1
225.4	2815	1905	1455	1037	646	2954

Most of the features do have a weight of 1, which is actually not bad.

SiggiSmara commented 6 years ago

Just as a thought after I have had my morning coffee.

Given that we know that the signal detection is not always so great at actually detecting peaks would it not make more sense to have a higher weight on two filled-in signals that are less than 30% RSD? Even to the point of it being higher than two signals detected with RSD > 30%?

And sorry for adding potentially more work on your plate @jotsetung but Is there any way to visualize the overall behavior of these groups (e.g. PCA of the dataset) to see if they behave somewhat similar in terms of the grouping? That might help to decide the weights but you'll have to tell us if this is feasible or not.

jorainer commented 6 years ago

Re weights: yup. agree here. So I'll go for:

1: signal from 2 detected peaks present and the RSD of their abundances is < 30%. The mean of the two is reported.
0.9: 1 detected and one filled-in signal present and their RSD is < 30%. The mean of the two is reported.
0.8: only filled-in signal, but their RSD is < 30%. The mean of the two is reported.
0.7: signal for two detected peaks but their RSD is >= 30%. The mean of the two is reported.
0.5: signal from 1 detected peak present (independently of whether a filled-in signal is present for the other replicate). Report only the value for the detected peak.
0.25: only filled-in signal is present. Either mean or single value reported.

Re PCA, I did simply plot the data. There is some sort of grouping by source/matrix. But nothing really spectacular. Plasma samples do however show a higher percent of 0 weights, which means that no signal was present for some features.

jorainer commented 6 years ago

Closing this issue for now. We will next have to preprocess and normalize also the negative polarity data.

SiggiSmara commented 6 years ago

Ah... no I was thinking not by source/matrix but by these weights, i.e. 1 vs 0.9 or 0.8 vs 0.7 . Theoretically one would expect them to get somehow noisier or less informative the further down in weight you go or that at least would be the ideal situation.

Re PCA, I did simply plot the data. There is some sort of grouping by source/matrix. But nothing really spectacular. Plasma samples do however show a higher percent of 0 weights, which means that no signal was present for some features.

EuracBiomedicalResearch / VAMS_vs_intraveinous