Searching each sample separately vs together in MSFragger

apelin20 commented 3 years ago

Hello, I wanted to ask further clarification on using multiple cores to search MSFragger. We have servers and clusters that are very RAM heavy and I am trying to speed up the MSFragger search. In my last post I was told that searching samples with MSFragger individually or separately shouldn't make a difference since FDR on searches is calculated later with Philosopher.

I have 71 samples from an IP experiment and I searched them:

All at the same time in 1 instance of MSFragger with 32 cores
All at the same time in 1 instance of MSFragger with 1 core
All in separate instances of MSFragger, 1 core each, 16 at a time

What I find is that the SPC values of my baits are the same with the first 2 options but slightly different when searching each sample at a time (replicates separated by |):

The third method still has very similar values. I am just wondering what could be causing this slight variation?

Thanks, Adrian

fcyu commented 3 years ago

I guess you are using calibrate_mass = 2, which finds the optimal parameters based on the data being searched. Thus, there might be slightly difference in the optimized parameters with different set of input files.

And not surprisingly, the "1core_1-instance-per-sample" has the highest sensitivity since the optimized parameters are for each individual file. Another two approaches' optimized parameters are for all files together.

You can try with calibrate_mass = 1.

Best,

Fengchao

apelin20 commented 3 years ago

That's correct, calibrate_mass was set to 2. I took my config (https://pastebin.com/beykRgy5) from one of the example sections and don't quite understand all parameters yet.

I guess finding the optimal parameter with calibrate_mass=2 was being influenced by amount of samples. Do I need to calibrate mass?

fcyu commented 3 years ago

Yes, you need. You can set it to 1 if you don't want MSFragger to adjust the parameters. Then, MSFragger will only calibrate the mass.

FYI, here (https://github.com/Nesvilab/MSFragger/wiki/Setting-the-Parameters) is the page with the explanations for the parameters.

Best,

Fengchao

apelin20 commented 3 years ago

Sorry to keep bugging you. I looked at the parameter explanation but unfortunately to someone who just joined the world of MS, some of these parameters are not as straight forward. I am trying to understand the importance of how the search is done.

OK, so I set calibrate_mass=1, and ran the search separately (one process/core per sample) and together (32 cores for all samples in one instance). Of course you were right, this time the AvgSPC for the baits in the AP-MS experiment was the same, regardless if there were individual searches for every samples or one search for all samples.

What confuses me, is the fact that AvgSPC for the baits is lower when setting calibrate_mass=1:

Last column is calibrate_mass=1 (same for separate or simultaneous searches), first column is one search, 32 cores for all samples (calibrate_mass=2), and second column is separate searches, 1 core per search (calibrate_mass=2).

Could you tell me a bit more about this parameter. How come when you optimize it with option "2" you get higher SPC values while when you don't it's lower (I know it sounds like a silly question, optimizing should result in better outcome)? I am not against higher SPC, and it seems the highest are when calibrate_mass=2 and one sample per search. I am mostly concerned about lower end SPC values, which I have yet to analyze deeper with these various search thresholds.

fcyu commented 3 years ago

Sorry to keep bugging you. I looked at the parameter explanation but unfortunately to someone who just joined the world of MS, some of these parameters are not as straight forward. I am trying to understand the importance of how the search is done.

To make it easier to use, we built a GUI tool, FragPipe, for users who are not the exporters in computational mass spectrometry. It has built-in workflows to perform various analysis, including label-free quantification, TMT analysis, open modification analysis, DIA data analysis, etc. For most of the analysis, you just need to load the workflow and not need to adjust individual parameters. You can find the details from https://fragpipe.nesvilab.org/.

Could you tell me a bit more about this parameter. How come when you optimize it with option "2" you get higher SPC values while when you don't it's lower (I know it sounds like a silly question, optimizing should result in better outcome)? I am not against higher SPC, and it seems the highest are when calibrate_mass=2 and one sample per search. I am mostly concerned about lower end SPC values, which I have yet to analyze deeper with these various search thresholds.

That is what parameter optimization for. With calibrate_mass = 1, MSFragger only performs the mass calibration but no parameter optimization. With calibrate_mass = 2, after performing mass calibration, MSFragger tries to find the best fragment tolerance, top-N peaks, and etc using the data being analyzed. It might get different optimal values with different sets of data files. If there is only one file, MSFragger optimizes the parameter for that file. If there are multiple files, MSFragger optimizes the parameter using the one with the most data. Thus, "the highest are when calibrate_mass=2 and one sample per search".

Best,

Fengchao

sarah-haynes commented 3 years ago

Hi Adrian, you can also see the 'Mass calibration and parameter optimization' section of this paper for more detail: https://www.nature.com/articles/s41467-020-17921-y

Sarah

Nesvilab / MSFragger

Searching each sample separately vs together in MSFragger #192