EuracBiomedicalResearch / VAMS_vs_intraveinous

Comparison of Volumetric Absorbtive Microsampling to venous blood sampling
0 stars 0 forks source link

Chromatographic peak detection, POS #5

Closed jorainer closed 6 years ago

jorainer commented 6 years ago

Perform the chromatographic peak detection on the positive polarity data. This should also include some validation and quality assessment of the identified peaks.

jorainer commented 6 years ago

@SiggiSmara's centWave parameters were:

To specifically define the ppm and peakwidth settings we will evaluate the data of known compounds in the QC samples.

SiggiSmara commented 6 years ago

Some things to think about in terms of the centWave parameters, all are based on exploration of the data and impressions (no statistics behind) but also looking at known compounds:

jorainer commented 6 years ago

Update on the mzdiff parameter. This is used in the peak post-processing, i.e. after all peaks have been identified in a sample, the function iterates through all peaks and for peaks defined to be overlapping in rt and mz dimension (depending on this parameter) keeps only the one with the highest signal.

SiggiSmara commented 6 years ago

Hmm.. ok and what is the definition of overlap in rt then? Does that mean that it is better to allow mz overlap in terms of being more inclusive?

jorainer commented 6 years ago

honestly - no idea. I need to dig into the undocumented C-code for that...

SiggiSmara commented 6 years ago

holy guacamolee, that deep? github has no emoticon for the appropriate emotion expression .. maybe :astonished: or :scream_cat: or :see_no_evil: :hear_no_evil: :speak_no_evil:

jorainer commented 6 years ago

Now, step by step: ppm parameter. For each known compound in all QC samples:

The table below lists all considered compounds and the mean and 75% quantile ppm (columns "mean_ppm" and "75_quant_ppm".

  name mz rt_min rt_max mean_ppm 75_quant_ppm
kc_1 Glycine 76.04 167 173 32.22 46.35
kc_3 Alanine 90.05 161 167 39.22 61.73
kc_7 Dimethylglycine 104.1 161 167 63.36 93.35
kc_10 Serine 106 178 183 23.68 33.5
kc_13 Proline 116.1 155 167 61.65 80.33
kc_18 Betaine 118.1 155 162 17.87 19.8
kc_19 Valine 118.1 155 163 60.82 76.54
kc_21 Threonine 120.1 168 175 71.95 102.1
kc_22 Phenylethylamine 122.1 31 35 18.58 23.88
kc_23 Niacinamide 123.1 40 53 81.12 97.99
kc_24 Taurine 126 166 171 68.17 87.3
kc_26 Pipecolic acid 130.1 155 160 58.67 77.11
kc_28 Hydroxyproline 132.1 169 175 91.93 112
kc_30 Creatine 132.1 156 165 63.78 83.64
kc_31 Leucine 132.1 142 153 30.41 40.26
kc_32 Isoleucine 132.1 142 153 30.41 40.26
kc_34 Asparagine 133.1 181 186 33.26 48.71
kc_35 Ornithine 133.1 194 202 49.82 62.77
kc_36 L-Aspartic Acid 134 184 192 47.46 68.79
kc_43 Glutamine 147.1 176 182 33.1 49.39
kc_44 Lysine 147.1 191 198 20.8 28.36
kc_46 L-Glutamic Acid 148.1 170 176 44.22 60.58
kc_52 Histidine 156.1 191 198 34.33 46.44
kc_54 Succinylacetone 159.1 26 29 35.74 52.18
kc_56 alpha-Aminoadipic acid 162.1 165 169 27.09 37.18
kc_61 Phenylalanine 166.1 153 157 5.73 6.745
kc_67 1-Methylhistidine 170.1 185 197 49.96 65.7
kc_68 3-Methylhistidine 170.1 185 197 49.96 65.7
kc_71 N-Acetylornithine 175.1 148 154 62.72 94.88
kc_72 Arginine 175.1 189 197 71.84 92.97
kc_75 Citrulline 176.1 180 185 48.55 64.9
kc_79 Glucosamine 180.1 163 170 62.12 80.82
kc_81 Fructose 203.1 158 168 16.44 21.09
kc_82 Mannose 203.1 158 168 16.44 21.09
kc_85 Tyrosine 182.1 159 166 50.34 65.07
kc_86 1-Methyluric acid 183.1 145 149 57.58 91.89
kc_88 galactitol 183.1 159 165 84.79 109.5
kc_90 Phosphorylcholine 184.1 190 200 67.14 80.92
kc_95 Caffeine 195.1 30 34 55.79 71.69
kc_100 ADMA 203.2 178 186 56.69 73.23
kc_103 Tryptophan 205.1 145 154 47.84 65
kc_106 Phosphocreatine 212 183 189 31.01 44.49
kc_122 Sphingosine 300.3 28 31 7.841 10.66
kc_123 Acetylneuraminic Acid 332.1 165 170 63.45 78.58
kc_125 Sucrose 365.1 183 193 90.98 109

The ppm values are surprisingly high. The mean of them is 63. I will thus use a ppm = 70 in the initial analysis and evaluate then the identified peaks for the known compounds. In addition I will compare it with a lower ppm setting.

SiggiSmara commented 6 years ago

Ok now I am confused. Are you taking all m/z values in the window of extraction? and are you using the theoretical m/z value as a divisor?

Maybe I am misunderstanding what you are doing, but in my mind you would need to do some sort of determination on a run by run basis what m/z range to use if you want to generate statistics on the m/z values found in your search range. Otherwise you might be including all sorts of noise in the data.

jorainer commented 6 years ago

What I'm doing: 1) in each file, each compound: get all m/z values within the rt-window and theoretical m/z +/- 0.01. 2) the ppm is then the diff(range(mz)) * 1e6 / mz_comp, mz_comp being the theoretical m/z of the compound. This can (because of the +/- 0.01) include some noisy signal - but I hope that this is removed in the peak refinement step.

so yes, I'm using the theoretical m/z as divisor - hope that's OK as I always get confused with this ppm.

SiggiSmara commented 6 years ago

I just have a reall problem with 70ppm being the spread of the m/z values in a peak given that what I have seen so far is closer to 10ppm or lower.

Maybe a picture can help us.
theory

In order for you not to inflate the calculated ppm spread for each run you would have to limit the range calculation to the box that is drawn by the blue and red small dotted lines.

Any value outside of that box is only going to increase the ppm error artificially, in particular since your m/z window is rather big. And since the rt window you use is necessarily larger than any individual peak in any individual run you will add quite a lot of noise in your data.

Actually I have some R script I could run on the VAMS data to see if what I'm writing here is right... let me see if I can hack some results...

SiggiSmara commented 6 years ago

Sorry again, deleted my previous post. Found an error in my script that makes all of the results I had seen invalid. I'm done for the day, let's talk tomorrow.

jorainer commented 6 years ago

Be aware that the ppm does not define the final m/z width of the peak. This is just for the initial identification of the ROIs. Peaks are then identified in those using centWave and these are further refined. I'll dig here into the (again undocumented C-code) to understand what exactly ppm does.

I'll then compare the results from different ppm settings - but that will require some time, since I have to fix something in xcms first...

jorainer commented 6 years ago

That's how the ppm parameter is used: wcm0095

So, a ROI is extended if there is an m/z value that is closer to the ROIs mean m/z than defined by the ppm parameter. I'll update the documentation in xcms to make that clear.

SiggiSmara commented 6 years ago

Yes that is true,but the correct way of thinking about this parameter in my opinion is that it should somehow reflect the reality of the data, i.e. be based on the real ROI data. As a mass spec guy to see that the ppm width of the known compounds is 70 ppm is startling. Also since the reality is probably closer to 10-20ppm. I already didn't like the fact that I used 50 ppm in my settings, if we now go to 70 ppm that is an even bigger window and the possibility of artificial overlap increases.

SiggiSmara commented 6 years ago

I'm generating some data using the known compounds that I hope is reflecting this in a better way. So far I'm happy with cafeine (my test compound). It gives me a mean ppm width (based on standard deviation) of close to 15.

jorainer commented 6 years ago

You have to consider that the rt region I manually define needs to accommodate the full chromatographic peak across all samples - and most are not well aligned and I'm thus accepting a lot of noise in there (going manually through chromatograms of 100 compounds on 200 samples to define a more realistic rt range is simply not something I would like to do...). In the end I have the hope that the refinement step reduces the m/z width - something I'm going to check for different settings of ppm anyways.

jorainer commented 6 years ago

In the end there we'll have to agree on a tradeoff anyway - using a too low ppm will lead the ROIs to lack some data points and the chromatographic detection will have in such cases trouble identifying a well-shaped peak - in worst case splitting the peak in half or even worse just reporting a part of it.

jorainer commented 6 years ago

Update: now I'm reporting the maximal difference between m/z values from consecutive scans, which is a better proxy for the ROI definition:

  name mz rt_min rt_max mean_ppm 75_quant_ppm
kc_1 Glycine 76.04 167 173 41.78 62.71
kc_3 Alanine 90.05 161 167 46.37 76.68
kc_7 Dimethylglycine 104.1 161 167 63.37 84.81
kc_10 Serine 106 178 183 22.36 30.39
kc_13 Proline 116.1 155 167 53.81 70.53
kc_18 Betaine 118.1 155 162 13.16 8.813
kc_19 Valine 118.1 155 163 39.69 58.54
kc_21 Threonine 120.1 168 175 55.39 82.75
kc_22 Phenylethylamine 122.1 31 35 11.54 10.21
kc_23 Niacinamide 123.1 40 53 70.01 96.11
kc_24 Taurine 126 166 171 69.54 103.7
kc_26 Pipecolic acid 130.1 155 160 37.88 50.46
kc_28 Hydroxyproline 132.1 169 175 80.88 115.4
kc_30 Creatine 132.1 156 165 56.09 69.46
kc_31 Leucine 132.1 142 153 19.73 37.66
kc_32 Isoleucine 132.1 142 153 19.73 37.66
kc_34 Asparagine 133.1 181 186 19.88 31.72
kc_35 Ornithine 133.1 194 202 38.47 59.49
kc_36 L-Aspartic Acid 134 184 192 28.5 50.49
kc_43 Glutamine 147.1 176 182 21.16 30.84
kc_44 Lysine 147.1 191 198 13.32 18.74
kc_46 L-Glutamic Acid 148.1 170 176 24.42 28.73
kc_52 Histidine 156.1 191 198 21.79 34.13
kc_54 Succinylacetone 159.1 26 29 17.82 25.99
kc_56 alpha-Aminoadipic acid 162.1 165 169 17.04 20.65
kc_61 Phenylalanine 166.1 153 157 2.352 3.007
kc_67 1-Methylhistidine 170.1 185 197 32.7 46.48
kc_68 3-Methylhistidine 170.1 185 197 32.7 46.48
kc_71 N-Acetylornithine 175.1 148 154 41.16 64.1
kc_72 Arginine 175.1 189 197 42.6 70.89
kc_75 Citrulline 176.1 180 185 26.72 36.22
kc_79 Glucosamine 180.1 163 170 32.91 45.05
kc_81 Fructose 203.1 158 168 4.472 4.833
kc_82 Mannose 203.1 158 168 4.472 4.833
kc_85 Tyrosine 182.1 159 166 29.29 45.09
kc_86 1-Methyluric acid 183.1 145 149 22.51 47.2
kc_88 galactitol 183.1 159 165 51.68 77.89
kc_90 Phosphorylcholine 184.1 190 200 36.88 45.72
kc_95 Caffeine 195.1 30 34 18.3 20.21
kc_100 ADMA 203.2 178 186 27.07 36.13
kc_103 Tryptophan 205.1 145 154 20.99 25.21
kc_106 Phosphocreatine 212 183 189 9.342 12.14
kc_122 Sphingosine 300.3 28 31 2.503 3.395
kc_123 Acetylneuraminic Acid 332.1 165 170 19.95 23.31
kc_125 Sucrose 365.1 183 193 25.14 30.9

The average of the 75% quantile is now 44 - so a ppm = 50 was actually not bad :)

SiggiSmara commented 6 years ago

That is definitely better! I understand your approach, I'm just not entirely convinced on having a global retention time window in this type of data generation :smile: It will give you results relatively simply but the quality of those results is a bit questionable in my mind. Hence I'm hacking a way to do peak detection on the chromatograms, let's see if I am successful in that.

jorainer commented 6 years ago

Now, the results from the peak detection with ppm = 70 are not too bad:

  name rt_width mz_width mz_width_ppm mz_ppm multi_pk_count all venous RBC capillary plasma
kc_1 Glycine 3.753 0.0002509 3.299 9.94 0 100 100 100 100 100
kc_3 Alanine 3.652 0.0002983 3.312 5.399 0 100 100 79.55 97.73 100
kc_7 Dimethylglycine 2.869 0.001147 11.02 2.362 0 95.83 93.18 18.18 20.45 56.82
kc_10 Serine 3.256 0.0005667 5.343 1.997 0 100 100 100 100 100
kc_13 Proline 4.1 0.0003432 2.957 2.042 98 100 100 97.73 100 100
kc_18 Betaine 4.366 0.0003454 2.925 1.55 52 100 100 61.36 38.64 100
kc_19 Valine 4.434 0.0003547 3.004 1.546 54 100 100 68.18 38.64 100
kc_21 Threonine 3.87 0.0002339 1.948 1.397 0 100 100 100 100 100
kc_22 Phenylethylamine 4.22 0.0006302 5.161 1.772 0 33.33 15.91 22.73 29.55 22.73
kc_23 Niacinamide 7.145 0.001107 8.998 4.297 22 91.67 97.73 97.73 81.82 0
kc_24 Taurine 3.782 0.0004699 3.729 2.66 3 100 100 100 100 100
kc_26 Pipecolic acid 3.648 0.001155 8.881 1.881 3 87.5 34.09 47.73 45.45 34.09
kc_28 Hydroxyproline 3.472 0.001956 14.81 10.92 38 100 97.73 95.45 100 100
kc_30 Creatine 4.193 0.0005438 4.118 4.144 6 100 86.36 29.55 40.91 25
kc_31 Leucine 4.218 0.0003974 3.008 1.632 156 100 100 100 100 100
kc_32 Isoleucine 4.218 0.0003974 3.008 1.632 156 100 100 100 100 100
kc_34 Asparagine 3.49 0.0003877 2.914 1.403 0 100 100 100 100 100
kc_35 Ornithine 3.91 0.0002308 1.734 1.494 0 100 100 100 100 100
kc_36 L-Aspartic Acid 3.892 0.0005754 4.293 1.686 9 100 100 81.82 9.091 100
kc_43 Glutamine 4.034 0.0002848 1.937 1.687 0 100 100 100 100 100
kc_44 Lysine 4.233 0.0002997 2.037 1.221 0 100 100 100 100 100
kc_46 L-Glutamic Acid 4.102 0.0003387 2.288 1.326 1 100 100 100 100 100
kc_52 Histidine 4.376 0.0002603 1.668 1.278 0 100 100 100 100 100
kc_54 Succinylacetone 1.978 0.003825 24.05 3.268 0 58.33 27.27 43.18 40.91 34.09
kc_56 alpha-Aminoadipic acid 3.332 0.0008565 5.284 1.814 0 100 100 93.18 100 100
kc_61 Phenylalanine 4.397 0.0005309 3.197 13.78 0 100 100 43.18 68.18 95.45
kc_67 1-Methylhistidine 4.026 0.0006932 4.076 1.435 159 100 100 100 100 100
kc_68 3-Methylhistidine 4.026 0.0006932 4.076 1.435 159 100 100 100 100 100
kc_71 N-Acetylornithine 4.268 0.0006216 3.55 16.53 2 100 100 59.09 93.18 93.18
kc_72 Arginine 4.008 0.0003901 2.228 1.136 0 100 100 100 100 100
kc_75 Citrulline 3.651 0.0004082 2.318 1.278 0 100 100 100 100 100
kc_79 Glucosamine 3.799 0.0009793 5.438 1.526 15 91.67 56.82 27.27 54.55 84.09
kc_81 Fructose 4.26 0.0006844 3.371 2.482 181 100 100 100 100 100
kc_82 Mannose 4.26 0.0006844 3.371 2.482 181 100 100 100 100 100
kc_85 Tyrosine 3.718 0.001131 6.213 4.702 1 100 100 38.64 54.55 100
kc_86 1-Methyluric acid 2.601 0.000955 5.217 52.36 0 87.5 84.09 79.55 84.09 79.55
kc_88 galactitol 3.4 0.00198 10.82 15.73 0 100 65.91 47.73 59.09 15.91
kc_90 Phosphorylcholine 5.55 0.001941 10.54 2.101 1 100 100 97.73 100 6.818
kc_95 Caffeine 3.068 0.0006805 3.488 1.042 0 100 100 100 100 100
kc_100 ADMA 4.692 0.0005263 2.591 1.232 0 100 100 100 100 100
kc_103 Tryptophan 3.997 0.0006472 3.156 1.302 90 100 100 79.55 100 100
kc_106 Phosphocreatine 3.767 0.00126 5.944 1.716 0 100 90.91 95.45 100 4.545
kc_122 Sphingosine 3.279 0.000666 2.218 1.962 0 100 97.73 100 100 100
kc_123 Acetylneuraminic Acid 3.537 0.00144 4.335 3.347 0 100 100 63.64 97.73 100
kc_125 Sucrose 4.288 0.002523 6.91 2.576 86 91.67 77.27 95.45 100 25

Summarizing, not bad. I'll repeat with a ppm = 50 and eventually also ppm = 30.

SiggiSmara commented 6 years ago

I think was to be expected (the peak refinement) otherwise xcms would not be doing its job. This I've already seen.

It will be interesting to see if the overlaps (multiple peaks) decrease with smaller ppm.

SiggiSmara commented 6 years ago

Yes agreed, I'm having some success with the chromatography peak detection. Caffeine (well behaved peak with mostly a single chrom peak) is reliably detected with reasonable start/stop regions as well. I'm now testing how it goes with all of them.

With the same analysis as you did with hard-limit retention times and using an old version of the known compounds I was able to reliably detecting the peak where it should be. Plus I got slightly better results from the same check you did (something like 12 ppm rather than the 18 ppm), but it definitely is dependent on the m/z window one is using. With 50 ppm window I get 12 ppm, with 100 ppm window I get closer to 20 ppm and using 25 ppm window I get 10 ppm.

I'm looking more closely at what this guy did with the centwave algorithm: https://github.com/nathaniel-mahieu/centWaveP I think this is something that will definitely help you when you start thinking about the chromatogram peak detection, might also benefit us in the broad / difficult chrom peaks in the untargeted.

And yes we are having some really good results out of this data set regarding known compounds. Of course I give all credit to the fantastic spectra averaging :stuck_out_tongue_winking_eye:

SiggiSmara commented 6 years ago

btw.. I'm getting a lot of these, but they are not repeatable (running the same script again complains in different places):

Error in x$.self$finalize() : attempt to apply non-function

jorainer commented 6 years ago

Re Error: I know - me too. there's something freaky going on with the Rcpp/cpp/pwiz bindings. No idea how we could get rid of that (without major performance decrease).

SiggiSmara commented 6 years ago

It's more annoying than anything else at this point since there is nothing to do except ignore it, are we the only ones seeing this? For a new user this would be a bit concerning.

jorainer commented 6 years ago

I'm now running the analysis with ppm = 50 and perform the comparisons to other ppm settings once I defined the workflow. With ppm = 50 I get (considering all identified peaks):

  all capillary plasma RBC venous
peak_count.0% 6244 5840 4470 2906 6786
peak_count.25% 6969 7368 5630 6820 7850
peak_count.50% 8060 8005 6564 7279 8752
peak_count.75% 8241 8476 7161 8114 9628
peak_count.100% 8690 9871 7681 9440 10409
median_mz_width.0% 4.801 4.576 3.978 3.461 5.107
median_mz_width.25% 5.176 5.333 4.317 5.31 5.513
median_mz_width.50% 5.525 5.637 4.729 5.63 5.76
median_mz_width.75% 5.95 6.098 5.209 6.248 6.64
median_mz_width.100% 6.278 7.394 6.215 8.987 7.401
median_rt_width.0% 3.627 3.627 3.627 3.627 3.627
median_rt_width.25% 3.627 3.627 3.627 3.627 3.627
median_rt_width.50% 3.627 3.627 3.627 3.627 3.627
median_rt_width.75% 3.627 3.627 3.627 3.627 3.627
median_rt_width.100% 3.627 3.906 3.627 3.627 3.627

This summarizes the number of identified peaks per source type and the distribution of per-sample median m/z and rt widths.

The per-sample median m/z width of the identified peaks is on average quite low and consistent across sample types, same as the retention time width. The only apparent differences are for the numbers of detected peaks. Here plasma samples show the lowest and venous samples the highest numbers.

jorainer commented 6 years ago

Re-opening issue, because we should/could improve here. I do for example not understand how centWave defines this peak: red indicates the peak identified on the full data, blue is the peak identified with the same settings, but running the detection on the chromatographic data only (i.e. using findChromPeaks,Chromatogram,CentWaveParam method). And green is using integrate = 2 and peakwidth = c(2, 14). The 2 in the peakwidth parameter is actually important, because any value below 2 will cause the descend peak function that defines the boundaries to not allow any outliers. UPDATE actually, the last comment was wrong. Even with peak sizes smaller 2 outliers are allowed.

rplot

jorainer commented 6 years ago

peakwidth = c(1.5, 14) and integrate = 2 tends to create too broad chrom peaks. The plots below shows the same compound with integrate = 1 and integrate = 2:

kc_122_chrom_plot kc_122_chrom_plot

With integrate = 1 (uses the mexican hat filtered data) peaks are mostly centered around the apex, while integrate = 2 (that walks down the actual signal) includes in my opinion too much data.

From the total numbers of peaks and correspondence results there is almost no difference between integrate = 1 and integrate = 2.

Nevertheless, the integrate = 2 seems to be an interesting alternative, since it captures better the real peak shape.

jorainer commented 6 years ago

Increasing the lower peakwidth to 2 (peakwidth = c(2, 14)) reduces the number of features with multiple peaks. Chromatographic peaks for all known compounds look still OK.

jorainer commented 6 years ago

Update: we will use peakwidth = c(2, 14) because of the fewer multi-peak features and a slightly lower maximal difference (median across all features) between feature's abundances in QC samples. integrate = 2 could even reduce the number of multi-peak features.