Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
184 stars 37 forks source link

How to interpret OpenSearch Global.Profile.tsv file? #332

Closed BenSamy2020 closed 3 years ago

BenSamy2020 commented 3 years ago

Greetings,

I would like to thank you for providing this amazing Open-Source tool. Currently, I am performing Open-Search across multiple .RAW file. I would like to incorporate the identified mass-shifts (modification) in my close search workflow. I am having difficulty understanding the Global.Profile.tsv.

By any chance do you have any guide-section/information on how to interpret the Global.Profile.tsv file mass-shift results to incorporate for my close search? I have attatched a sample Global.Profile file (.xlsx) for your reference below.

Regards, Ben

Global.Profile.xlsx

BenSamy2020 commented 3 years ago

Greetings,

Apologies, for not being clear in my previous thread. I understand that there is an output file explanation available at https://github.com/Nesvilab/PTM-Shepherd/wiki/3.-Output. For the sample file provided above (Global.Profile.xlsx): for the modification of ubiquitinylation residue/Double Carbamidomethylation/Addition of N (row 5 of .xlsx file) do we incorporate the modification of 114.042927 to amino acid residue "K" only due to having the highest enrichment score?

Also the N-term rate score shows 39.11. Does this value detonate the requirement to incorporate the N-term modification of 114.042927?

Based on the .xlsx file above, how many modifications should I take into account to incorporate into my close search workflow? is there any filters that I could use e.g. Matched PSMs threshold score?

*I am not sure where to post this thread either FragPipe or PTM-Shepherd. If the thread is placed here incorrectly, my apologies.

Regards, Ben

danielgeiszler commented 3 years ago

Hi Ben,

There are no rules for when you should incorporate additional modifications into your search. PTM-Shepherd it meant to guide your downstream analysis, not determine it. In this case, it looks like your sample is overalkylated due to the number of +57 and +114 (+57 * 2) mass shifts, so what you should be doing is looking at how many +57 mass shifts to incorporate as variable modifications rather than incorporating a +114 modification. The only way to determine the best parameters is to try them, but this looks like having +57 on the peptide N-term and K would be a good place to start.

For your data, the modification annotation results may be more clear if you rerun PTM-Shepherd with a +57 Carbamidomethylation mass shift instead of a -57 Failed_Carbamidomethylation mass shift.

BenSamy2020 commented 3 years ago

Greetings Daniel,

Thank you for your prompt reply. I am contacting on behalf of my institute's core proteomics faculty. I am in the process of implementing MSFragger as the mainstay proteomics search pipeline in my faculty. Furthermore, I am trying to incorporate Open Search and thereafter Close Search as a standard pipeline for my customers.

Since we have project specific time-frames to meet, it will be difficult for the proteomics facility to perform on-job optimization in the context of catering to multiple customers.

Based on your experience would it be possible that you could provide me a couple of suggestions on how to identify which are the most suitable modifications to be incorporated into the Close Search workflow (e.g. selecting the top 5 modifications based on top 5 Matched PSMs tab, thereafter within the top 5 modifications we can determine where to incorporate the variable modifications based on the enrichment scores from Enriched AA1, 2 ,3 and N-term rate tabs) - if you think this is a systemic/rational approach, please do let me know. .

I understand that PTM_Shepherd is meant to only provide a guide on which modifications to select. If a brief guideline is not possible to be provided, I would have to only implement Close Search as part of the pipeline. If this happens, my customers would not be able to obtain the most out of their proteomics data.

Regards, Ben

BenSamy2020 commented 3 years ago

Greetings Daniel,

If a possibility of a suggestion for the above is not applicable, please do close this issue.

Regards, Ben

danielgeiszler commented 3 years ago

Ben,

After some discussion, I think it's safe to include any modification-site pair that represents >= 5% of your data. For N-termini, this would be (N-term localization rate) * (localized PSMs). For residue-specific modifications, it would be the number following the enrichment score.

This will disproportionately allow N-terminal site due to how they're calculated, but those also require the least computational power to process. If you find that you have the computational resources, maybe reducing the residue-specific threshold to less than 5% of total PSMs would be appropriate. I'll update the PTM-Shepherd GitHub page to reflect this recommendation for your reference.

BenSamy2020 commented 3 years ago

Greetings Daniel,

Thank you for providing your valuable advise.
Please do update me when you have updated the PTM-Shepherd GitHub page with the recommendations.

Regards, Ben