Improved Match between runs MS1 feature detection

Peer2011 commented 3 years ago

Dear Alexey, Fengchao and the rest of the FragPipe team I have to say that you have done a really great job on the new Fragpipe version 17.0 which really improves IDs in our datasets even further. Our data are all recorded in the DDA mode and I am still puzzled even with IonQuant which is much better than LFQ from MaxQuant how to get rid off or at least improve the missing data problem. I just wanted to hear your opinion about the IceR workflow from the Krijgsveld lab. We tried to implement it but so far it was very difficult to get it running. Have you thaught about implementing an approach for MS1 feature extraction such as IceR?

Best, Peer

anesvi commented 3 years ago

I personally need to read IceR paper. At quick glance a while back all comparisons they did were with a very early version of IonQuant, before MBR etc. We should probably review that paper in more detail if you think it is doing something really well. What we do know is that, when people compare with MaxQuant and get less missing values, it is because we apply by default 1% FDR for MBR, and there is none in MaxQuant. So one may think that there are less missing values with an another tools, but are those quant values really reliable? Alexey

From: Peer2011 @.> Sent: Monday, November 8, 2021 8:34 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: [Nesvilab/FragPipe] Improved Match between runs MS1 feature detection (Issue #516)

External Email - Use Caution

Dear Alexey, Fengchao and the rest of the FragPipe team I have to say that you have done a really great job on the new Fragpipe version 17.0 which really improves IDs in our datasets even further. Our data are all recorded in the DDA mode and I am still puzzled even with IonQuant which is much better than LFQ from MaxQuant how to get rid off or at least improve the missing data problem. I just wanted to hear your opinion about the IceR workflow from the Krijgsveld lab. We tried to implement it but so far it was very difficult to get it running. Have you thaught about implementing an approach for MS1 feature extraction such as IceR?

Best, Peer

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/516, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM637DSOMOYBA5IWSUVLUK7GWDANCNFSM5HSV2N7Q. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

danielgeiszler commented 3 years ago

I'll get back to this once I have time to read the paper more thoroughly because their method is quite extensive, but it seems like iceR also does a smart noise imputation step by default for missing values (step 12 in their methods). Looking at Figure 2B from the manuscript, they only have 0.2% missing peptide intensities which I don't think is possible if those are all real IDs and not just noise. We'll have to do some internal comparisons to see whether the discrepancy in missing values is due to their noise imputation and see where to go from there.

Peer2011 commented 3 years ago

Dear Alexey and Daniel, We additionally had the feeling that using raw files instead of mzml files as FragPipe input generates lower missing values when we raised the Ion FDR incrementally from 0.01 to 0.05. Raising the Ion FDR from 0.01 to 0.05 for MBR had only a minor effect for mzml files raising the FDR using raw files as input significantly improved missing values. Could it be that there are just more MS1 noise features in the raw files which are filtered out from the mzml which would be taken into account by the FDR algorithm. I fully agree with you that it is very important to quantify true MS1 features. Could it help to constrain the search space of MS1 features by taken into consideration that at a given time point only certain MS1 features can be present or are you already doing this by your approach. Do you think that monoisotopic mass assignment might also play a role? Do you think that the ion intensity might play a role for monoisotopic mass assignment of an ion with the same m/z score? Sorry, most likely you think that these are quite naive questions. I know how to apply MS technology and have a lot experience in wet lab but besides analyzing search engine output tables with R only little on the real bioinformatic analysis of MS raw data. I am just searching a way how to squeeze as much as possible out of our datasets. I used to work in a mass spectrometry lab with neurodegenerative disease research focus but now I am a surgical pathologist who really would like to see mass spec in the clinics. Looking forward to hear back. Best, Peer

fcyu commented 3 years ago

Could it be that there are just more MS1 noise features in the raw files which are filtered out from the mzml which would be taken into account by the FDR algorithm.

I am not sure. The spectra from raw file and mzML file are slightly different due to the difference in the Thermo library. But, according to my experience, the MS2 spectra from raw file have higher quality. One thing need to investigate is that if the MS1 spectra from the raw file are centroided. The raw file may or may not contain centroided spectra. If I remember it correctly, we let Thermo library to centroid it if there is no centroid spectra.

Could it help to constrain the search space of MS1 features by taken into consideration that at a given time point only certain MS1 features can be present or are you already doing this by your approach. Do you think that monoisotopic mass assignment might also play a role? Do you think that the ion intensity might play a role for monoisotopic mass assignment of an ion with the same m/z score?

IonQuant has mechanism to detect MS1 features within a small region. We also consider the monoisotopic mass and mass calibration. All of these details can be found from https://www.sciencedirect.com/science/article/pii/S1535947621000505

Best,

Fengchao

Peer2011 commented 3 years ago

Dear Fengchao, I tried to understand what you are doing in the MBR algorithm for Label free quantification as good as possible. Form what I understand you also use log10 intensity as part of your score S in equation (2) in your paper. If the intensity becomes small this means that the score also becomes worse even if the peptide has the correct isotopic distribution and mass deviation in ppm. Isn't this counter intuitive in DDA data? How did you come up with this score? How do you know which of the features in your table 1 entering the score should contribute more or less to the score? Do you perform a retention time alignment prior to MS1 feature extraction? I could not find this information. Relating the MBR window. Does it affect the MBR FDR calculation for ions (meaning the bigger the window is, the more stringent FDR will be afterwards?). If I understood it correctly, the Posterior error probability score of each transferred ion has to be better, the more ions are transferred for a given MS file. Am I right with this assumption? Excuse the many questions but I would really like to understand the algorithm even if it yields very good results. Best, Peer

fcyu commented 3 years ago

Hi Peer,

Let me answer your questions point-by-point.

Isn't this counter intuitive in DDA data? How did you come up with this score?

I think it doesn't. Please note that it is MBR using MS1 signal. The higher intensity, the higher signal-to-noise ratio, which has a higher probability of being true signal. Please also keep in mind that the log10(intensity) score is not the only score, and (if you check at the log) it is not the highest weighted score. Thus, it means that low intensity peptides still have a chance to be scored good if other scores are good.

How do you know which of the features in your table 1 entering the score should contribute more or less to the score?

I/IonQuant doesn't know before analyzing the data. The weights are determined by the LDA: " Following the strategy we previously used for DIA data (20), we train a linear discriminant analysis (LDA) model using scores from type 2 and −2 ions. From the trained LDA, we calculate a final score for each type 1 and −1 ion"

Do you perform a retention time alignment prior to MS1 feature extraction?

I think the manuscript has the description. We perform alignment locally: "_For each ion in every selected donor run, we locate the target region within the acceptor run using an approach similar to FlashLFQ (29). First, pairs of retention times from the corresponding ions are collected and sorted according to the value from the donor run. Using di and ai to denote the retention times of i-th pair of ions from the donor and acceptor runs, respectively, we have pairs from (d1, a1) to (dN, aN) sorted by di, where N is the number of overlapped ions. Given a donor ion with retention time t, we find its position in the sorted pairs satisfying di≤t<di+1 . Then, we collect all pairs satisfying di−τ≤dj≤di+τ , where τ is a predefined tolerance (“MBR RT window” parameter, 1 min by default). With those pairs, we generate a list whose elements are aj − dj and calculate the median (m) and median absolute deviation ( σ ) of that list. The possible target range in the retention time dimension is then: [di+m−2σ,di+m+2σ] (1)

If ion mobility data are used, we take the same approach to locate the target range in the ion mobility dimension (controlled by the “MBR IM window” parameter, 0.05 by default). The transferred ion’s m/z equals the donor ion’s m/z adjusted by mass calibration error (mass calibration is performed by MSFragger (30)). After locating the target region in m/z, retention time, and ion mobility if applicable, we trace all peaks within the region using our recently described algorithm (22). Two isotope peaks (+1 and +2) are also traced to check the charge state and the isotope distribution. Peak boundaries are allowed to extend beyond the target region’s retention time and ion mobility bounds. Peak tracing is performed rapidly using the index, after which the donor ion’s peptide information is assigned to the traced monoisotopic peak._"

Does it affect the MBR FDR calculation for ions (meaning the bigger the window is, the more stringent FDR will be afterwards?)

Wider window results in more false positives, which result in fewer ions after fitering with the same FDR threshold.

If I understood it correctly, the Posterior error probability score of each transferred ion has to be better, the more ions are transferred for a given MS file. Am I right with this assumption?

The posterior error probability has to be lower/better, the more ions are left after FDR filtering.

Excuse the many questions but I would really like to understand the algorithm even if it yields very good results.

No worries. We would love to have interesting discussions with users. It helps to make our tools better.

Best,

Fengchao

Peer2011 commented 3 years ago

Dear Fengchao, Thanks for your quick reply. I understood that the feature that are in Table 1 are used to determine the score. That means basically that the linear discrimnation model determines the cut-offs for every feature in table 1 which contribute to the sum of equation 2, right?

Referring to my question about the RT window. Would it make sense to dynamically adapt this window instead of having a fixed window and thus always have the same number of ions as putative false positives to test against in the FDR algorithm, meaning if you have a high number of ions in a certain RT window to reduce the window size and if you have a low number of ions to increase the RT window size. Further could it help to do the retention time alignment on a global scale before MBR is done and using MS2 identified peptides detected in more than 80% of the runs as an internal calibration matrix? You should then have enough features that are definetly identified to align all other MS1 features and thus reduce the Delta RT which also contributes to equation number 2.

One final question. Why in the standard settings for LFQ the feature detection window is smaller than the MBR RT tolerance window? MBR tolerance window most likely contributes to FDR while feature detection window is only the time where the algorithm should search for a given ion, right.

Thanks for your patience.

Best, Peer

fcyu commented 3 years ago

Hi Peer,

That means basically that the linear discrimnation model determines the cut-offs for every feature in table 1 which contribute to the sum of equation 2, right?

LDA is used to get the final score for each transferred ion. The cut-off is determined by mixture modeling which is used to estimate FDR.

Would it make sense to dynamically adapt this window instead of having a fixed window and thus always have the same number of ions as putative false positives to test against in the FDR algorithm, meaning if you have a high number of ions in a certain RT window to reduce the window size and if you have a low number of ions to increase the RT window size.

The MBR windows is adapted. See the description in the manuscript (https://www.sciencedirect.com/science/article/pii/S1535947621000505): "Given a donor ion with retention time t, we find its position in the sorted pairs satisfying di≤t<di+1 . Then, we collect all pairs satisfying di−τ≤dj≤di+τ , where τ is a predefined tolerance (“MBR RT window” parameter, 1 min by default). With those pairs, we generate a list whose elements are aj − dj and calculate the median (m) and median absolute deviation ( σ ) of that list. The possible target range in the retention time dimension is then: [di+m−2σ,di+m+2σ]".

Just the window size is not determined by the number of ions, but the deviation of ion's RT. I think ion number cannot be used to adjust the window size because the window size measures the dissimilarity/tolerance of donor and acceptor runs. Ion number doesn't reflect it.

Further could it help to do the retention time alignment on a global scale before MBR is done and using MS2 identified peptides detected in more than 80% of the runs as an internal calibration matrix?

This is a protentional alternative approach. We use the local alignment not global alignment to make it robust for the cases such as the gradient lengths are different and the ions have different RT density.

Why in the standard settings for LFQ the feature detection window is smaller than the MBR RT tolerance window?

LFQ feature detection window is used to tolerant the deviation of RTs between MS1 apex and MS2, while the MBR windows is used to tolerant the difference of RTs between two runs. Thus, the MBR tolerance should be bigger.

Best,

Fengchao

Peer2011 commented 3 years ago

Does gradient length between different samples really plays a role if you perform a global alignment of all unidentified MS1 based on identified MS1 features and respective MS2 features putting them on a new time scale. This having said I could imagine that if you have samples which are really different of each other this approach could fail. However, I would think that there is a bunch of proteins/peptides/identified MS1 features that one will be able to find in every sample (peptides of trypsin autodigestion??).

fcyu commented 3 years ago

The alignment should be from identified and overlapped MS1 features. And yes, you can have a good global alignment method that support different gradient length (That's why I said "This is a protentional alternative approach.").

Best,

Fengchao

Nesvilab / FragPipe

Improved Match between runs MS1 feature detection #516