Closed jorainer closed 6 years ago
We base the analysis mostly on single-peak-features but include also features with multiple peaks per sample, if they are present in less than 10% of samples in which a peak was identified. Also, we require features to be found in more than 30% of samples of at least one sample group. This breaks down to:
count | |
---|---|
total_features | 10740 |
RBC > 50% | 4180 |
plasma > 50% | 4531 |
capillary > 50% | 4606 |
venous > 50% | 5329 |
The plot below shows the per-sample signal distribution of feature abundances in log2 scale
Plasma samples (blue) have less valid values for features compared to samples from other sources. Some of the RBC (red) samples also have a considerably lower number of detected features. The distribution of feature intensities seems to be comparable between samples, sources and batches. The high similarity of average abundances between samples visible above might however be misleading, since only signal from identified peaks are considered. Comparing signal intensities after filling-in missing peak data might be better, because it also reflects the fact that some features are simply not present in the data.
After filling in missing peak values the proportion of detected features is highly comparable between samples, but more differences between abundances are visible. Plasma samples (blue) show however considerably lower signal intensities than all other samples, which reflects that less peaks were detected for these features in these samples (and intensities were thus filled-in).
Below we show relative log abundance plots that represent the difference of the (log) abundances of each feature in a sample compared to the median abundance of that feature in samples from the same sample group (source).
The RLA plot above shows some potentially systematic drifts in signal intensities that includes also consistently lower average RLA values in the last samples from the first, and consistently higher average RLA values in the last samples from the second batch. Next we plot the RLA plots after filling-in missing peak data.
Just as an info: the distribution of the signal from detected (green) and filled-in (blue) peaks. Filled-in peaks can have quite high abundances, but are still lower than the detected peaks.
The plot below shows the mean RLA (per sample) along the injection index splitted by source.
The trend seen on QC pools seems, to some extend, also be present in venous blood samples and plasma blood samples. RBC and capillary samples show a higher variability. Over and above it seems also that the samples from the second batch have slightly higher abundances.
The injection index dependent signal drift is dependent on the batch/run, as can be seen by the very low correlation between slopes from the per-feature linear model fits that were performed separately for batch 1 and 2 below.
This means we will have to adjust the data with a model like y ~ inj_idx * batch
(allows a batch-specific signal drift) instead of y ~ inj_idx + batch
(assumes the signal drift to be the same in each batch). Note that this models adjust the signal drift and batch differences at the same time (for each individual feature).
Fitting the model y ~ inj_idx * batch
to the abundances for each feature. For this estimation of the injection index depedend signal drift and batch effect we are using the QC samples without filled-in data (thus just fitting on real detected peak data). The table below lists the numbers of feature for which the model could be fitted. It could not be fitted if there were less than 4 data points available, or if a feature had measurements in only one batch.
total_features | fitted | not_fitted |
---|---|---|
10740 | 6071 | 4669 |
Now, fitting just on the detected peaks can create problems: the signal drift estimation can become exaggerated. Some examples are shown below: filled circles are detected peaks signals, open circles filled-in signals. Green are the QC samples on which also the fit was performed.
All ~ 90 features with the steepest slopes show this same problem: the fit is performed through only few QC samples. What I also realized was that the filled-in signal is actually rather good. So I'm thinking now of a way to include the filled-in signals in the fit, eventually down-weighting them.
Fitting on all QC values (including filled-in peaks) has also drawbacks, especially in cases where only filled-in signals are available. In the end we might use an approach that includes filled-in signals, but down-weights them and that skips fitting for features for which not at least 2 detected and 6 filled-in peaks are available within each batch (the numbers will have to be adjusted still).
An update on filled-in peak signal: I calculated the ratio of abundances of detected against filled-in peak signal for each feature in QC samples:
For 75% of the features the detected peak signal is less than 4-fold higher than the filled-in peak signal. On average, the filled-in signal is about the half of the detected peak signal. This suggests that filling-in does to some extend represent signal from the ion, but mostly underestimates the signal.
Including the filled-in signals in the model fits is still not ideal, even not with downweighting. They have a too strong influence if there are only few detected peaks.
Since we're fitting anyway per-batch models it is simpler to perform the within-batch adjustment completely separate in each batch. In each of the two batches we fit the model y ~ inj_idx
on the QC samples (for each feature). The data is fitted on the detected peaks (not the filled-in peaks). No adjustment will be applied if:
These criteria will eventually be refined after looking through linear model fits for features that were excluded or not excluded.
Performance of the flagging criteria is done visually for features with the largest slopes (i.e. those that would be adjusted the most). With the criteria above I am not completely happy.
The fits below (for batch b on the right) would be considered OK.
Common to such problematic fits is that the filled green points scatter quite considerably around the model fit. Also individual measurements have large residuals. The residual flagging seems to be too robust against such cases. Eventually a cut-off based on the mean absolute residuals might work better.
The flagging on the injection order range seems to be OK (see example below that was flagged because of the values spanning a too narrow injection index range).
What about for each feature comparing the within batch correction between batches and flagging corrections that are outside the norm? In theory the majority of within batch variation should be somewhat consistent as long as the instruments themselves are not the greatest source of the variation. Obviously this would only work for larger batches such as the CHRIS 5000 and maybe not for a set of two batches.
But we could also look at all runs for a specific method and a specific matrix (serum, whole blood, cells) as the data source to compare with even if the batches in question are not related experimentally.
Another thought: in the last example that you showed (FT01431) you can also see that the feature is rather well behaved in all runs. There are outliers but they should be rather easily detected. If you use the whole data set (not only the QCs) to determine (uncorrected) robust outliers in the QCs as an additional criteria, would that allow for using more of the filled in values?
For FT11767 there is something else going on than just the stability of the feature itself. You have initial high spread of all signals in b and then they seem to settle down into something approximating the behavior in a. For such a feature it would probably be better to take a look at the peaks it is detecting to see if some of the noise is coming from feature detection / integration and if not, try to figure out the reason for this "extra" noise.
And anything with a correction slope of more than 0.5 (or another user defined cutoff) should be flagged no matter what since that would indicate an unstable signal.
Sorry I keep coming up with new ideas. And if you compare the QC correction to a correction based on all samples? best case scenario they should not differ that much and if they do then that could also be a reason to flag the feature.
What about for each feature comparing the within batch correction between batches and flagging corrections that are outside the norm? In theory the majority of within batch variation should be somewhat consistent as long as the instruments themselves are not the greatest source of the variation. Obviously this would only work for larger batches such as the CHRIS 5000 and maybe not for a set of two batches.
Nope, I don't think that works. Almost all features show different drifts between the two batches. Also, different features show completely different drifts. This all makes me question what the source of the drift might be. I was thinking eventually some metabolic process going still on (more or less) in the tubes or something like that? This would IMHO also explain why some metabolites show increasing levels.
Your suggestions @SiggiSmara are well taken. Looking back at the actual identified peaks is however not an option, because we have about 10,000 features. In the end I want to have an approach that can be applied to all features accepting that some adjustments might be wrong, but over and above quality should improve.
The take home messages for now are:
Based on the figures above (and looking through other ~ 200 too) I tuned the flagging criteria, specifically the one based on the model residuals. Model fits are now flagged if one of the following criteria applies:
This results in:
batch_a | batch_b | |
---|---|---|
total features | 10740 | 10740 |
large residuals | 825 | 119 |
low inj idx range | 929 | 607 |
valid model fits | 4422 | 5405 |
About half of all detected features can/will be adjusted for a signal drift.
To evaluate the performance of the within-batch normalization we use the abundances (from detected peaks) for replicated samples. We calculate the MRA (maximum ratio of abundances), which is simply the ratio between the maximum and minimum abundance of a feature measured in replicated samples.
For each replicate pair the 75% of all MRAs of (adjusted) features is calculated. 75% of features for a given replicate pair have an MRA smaller than this value. These 75% MRAs for the raw and adjusted data are then compared to evaluate the impact of the within-batch normalization.
The plot of the raw and adjusted 75% MRAs is shown below. Each point represents one replicate pair
The 75% MRA of replicated samples is between 1.2 and 2.5 which is actually quite good, meaning that most samples have a less than 2-fold difference between replicated measurements. Within-batch normalization had however no big impact on the MRA. This might be explained by the relatively moderate signal drift present in only few features (for most features the estimated signal drift slope was about 0 suggesting an almost absent signal drift).
The results are further summarized in the boxplot below which shows a tiny improvement of the 75% MRAs, especially considering the range and the 75% quantile.
After the last meeting, some discussions and a lot of thinking I think we should first remove any sample processing related variances and then signal drift etc (see also issue #14). RUV might be a nice choice here, but since it bases on measurements from internal standard and these have been added after sample processing to the sample mix I'm also not quite sure if that will be able to remove sample processing related variances (see also issue #13).
Approaches I'm going to compare are:
DESeq2
Anders et aledgeR
Robinson et alUpdate: the normalization factors for each sample are estimated on only detected signals as that seems to outperform the estimation on also the filled-in data (not shown).
To evaluate between sample normalization we can use the differences in abundances between the two replicated samples (which are independent of batch effects and, to some degree, injection order dependent signal drift). I've expressed the difference between replicates with the 75% quantile of MRA values for each replicate pair. This means that 75% of features for a replicate pair have an MRA (maximum ratio of abundances) smaller than this value. In addition I count for each replicate pair the number of features that have a more than two-fold difference in abundances.
The table below summarizes the results:
raw | sum | median | MRM | TMM | RUV.rand | RUVIII | NOMIS | |
---|---|---|---|---|---|---|---|---|
0% | 1.232 | 1.287 | 1.239 | 1.238 | 1.271 | 1.32 | 1.51 | 1.286 |
25% | 1.413 | 1.429 | 1.387 | 1.39 | 1.44 | 1.519 | 2.952 | 1.622 |
50% | 1.609 | 1.606 | 1.583 | 1.603 | 1.624 | 1.787 | 5.612 | 2.001 |
75% | 1.826 | 1.8 | 1.784 | 1.767 | 1.842 | 2.2 | 17.74 | 2.644 |
100% | 2.579 | 2.739 | 2.5 | 2.496 | 2.7 | 4.903 | 222.9 | 7.634 |
count M > 1 | 541.1 | 544.7 | 512.9 | 512.8 | 551.6 | 792.2 | 2167 | 1006 |
The first 5 lines represent the quantiles of the 75% quantile MRA (the 50% is thus the median the 75% MRA quantiles across all replicates). The last row is the mean number of features with a more than two-fold difference in abundances.
What it essentially tells:
The final strategy is now defined: 1) normalize between-sample differences (due to sample collection, processing etc). Here we use the median scaling that performed best; MRM performed similarly well, but all methods depending on internal standards were badly failing. 2) adjust injection order dependent signal drift. 3) adjust differences between batches. This is done per-metabolite or globally (if variance in QC samples is too large or too few valid measurements are available).
The improvement of the data is only very marginal, but I guess that's in part because the raw data is already quite OK.
After normalization we want to average the normalized feature intensities. Here we can report, along with the averaged intensities, also a weight for each measurement:
1
: signal from 2 detected peaks present and their difference is less than
50%. The mean
of the two is reported.0.9
: 1 detected and one filled-in signal present and their difference is
less than 50%. The mean
of the two is reported.0.75
: signal for two detected peaks but their difference is more than
50%. The mean
of the two is reported.0.5
: signal from 1 detected peak present (independently of whether a
filled-in signal is present for the other replicate is present). Report only
the value for the detected peak.0.25
: only filled-in signal is present.@SiggiSmara , @cvolani, any comments?
Instead of the difference between abundances being <> 50% we could also simply use an RSD <> 30%.
Makes sense to me.
Averaging replicates
After normalization we want to average the normalized feature intensities. Here we can report, along with the averaged intensities, also a weight for each measurement:
1
: signal from 2 detected peaks present and their difference is less than 50%. Themean
of the two is reported.0.9
: 1 detected and one filled-in signal present and their difference is less than 50%. Themean
of the two is reported.0.75
: signal for two detected peaks but their difference is more than 50%. Themean
of the two is reported.0.5
: signal from 1 detected peak present (independently of whether a filled-in signal is present for the other replicate is present). Report only the value for the detected peak.0.25
: only filled-in signal is present.@SiggiSmara , @cvolani, any comments?
I am now favouring flagging using the RSD. Does not make much sense on only two values, but metabolomics people seem to be used to the RSD - so not much to explain there.
I was thinking the same (not much sense in RSD for two values) but either way is fine IMO.
The final weight definition is:
1
: signal from 2 detected peaks present and the RSD of their abundances is < 30%. The mean
of
the two is reported.0.9
: 1 detected and one filled-in signal present and their RSD is < 30%. The
mean
of the two is reported.0.75
: signal for two detected peaks but their RSD is >= 30%. The mean
of
the two is reported.0.5
: signal from 1 detected peak present (independently of whether a
filled-in signal is present for the other replicate). Report only
the value for the detected peak.0.4
: only filled-in signal is present, but their RSD is < 30%. The mean of the values is reported.0.25
: only filled-in signal is present. Mean or single value reported.The average number of features (across all samples) with a specific weight is shown below
0 | 0.25 | 0.4 | 0.5 | 0.75 | 0.9 | 1 |
---|---|---|---|---|---|---|
225.4 | 2815 | 1905 | 1455 | 1037 | 646 | 2954 |
Most of the features do have a weight of 1, which is actually not bad.
Just as a thought after I have had my morning coffee.
Given that we know that the signal detection is not always so great at actually detecting peaks would it not make more sense to have a higher weight on two filled-in signals that are less than 30% RSD? Even to the point of it being higher than two signals detected with RSD > 30%?
And sorry for adding potentially more work on your plate @jotsetung but Is there any way to visualize the overall behavior of these groups (e.g. PCA of the dataset) to see if they behave somewhat similar in terms of the grouping? That might help to decide the weights but you'll have to tell us if this is feasible or not.
Re weights: yup. agree here. So I'll go for:
1
: signal from 2 detected peaks present and the RSD of their abundances is <
30%. The mean
of the two is reported.0.9
: 1 detected and one filled-in signal present and their RSD is < 30%. The
mean
of the two is reported.0.8
: only filled-in signal, but their RSD is < 30%. The mean
of the two is
reported.0.7
: signal for two detected peaks but their RSD is >= 30%. The mean
of
the two is reported.0.5
: signal from 1 detected peak present (independently of whether a
filled-in signal is present for the other replicate). Report only
the value for the detected peak.0.25
: only filled-in signal is present. Either mean or single value
reported.Re PCA, I did simply plot the data. There is some sort of grouping by source/matrix. But nothing really spectacular. Plasma samples do however show a higher percent of 0
weights, which means that no signal was present for some features.
Closing this issue for now. We will next have to preprocess and normalize also the negative polarity data.
Ah... no I was thinking not by source/matrix but by these weights, i.e. 1 vs 0.9 or 0.8 vs 0.7 . Theoretically one would expect them to get somehow noisier or less informative the further down in weight you go or that at least would be the ideal situation.
Re PCA, I did simply plot the data. There is some sort of grouping by source/matrix. But nothing really spectacular. Plasma samples do however show a higher percent of 0 weights, which means that no signal was present for some features.
Perform feature data normalization on the positive polarity data. This should include: