Questions about diaTracer and endogenous peptides search

ako81818 commented 6 days ago

- Describe the issue or question: Hello, I'm attempting to use the new diaTracer function in Fragpipe with diaPASEF data for no-enzyme peptidomics. Attaching the log file here. diaTracer seemed to finish and MzML files were generated. Work then progressed through to the MSFragger step when it threw an OutOfMemoryError. Since this is DIA data, I believe I cannot split the data to resolve this. Is there a way to assign more memory or otherwise resolve this issue?

How do I restart this workflow without having to rerun the completed steps? Do I have to restart FragPipe again and can just load the created MzML files, or can I restart even further into the workflow somehow?

Thank you, Andrew

log_2024-06-22_03-52-47.txt

fcyu commented 6 days ago

Is there a way to assign more memory or otherwise resolve this issue?

For the current version, unfortunately, no.

How do I restart this workflow without having to rerun the completed steps? Do I have to restart FragPipe again and can just load the created MzML files, or can I restart even further into the workflow somehow?

You could load the diaTracer's mzML files as DDA data type, and the original .d folder as DIA-Quant data type. Then, run FragPipe from scratch. It will skip the diaTracer spectra deconvolution.

Best,

Fengchao

ako81818 commented 6 days ago

So if I cannot give it more memory, I expect I'll keep running into the OutOfMemory issue... would you recommend trying different settings in MSFragger to reduce the memory load where I'm getting the issue or can I just not run so many data files in my study (79)?

Thank you again, Andrew

fcyu commented 6 days ago

Your configures have a large search space

num_enzyme_termini = 0
variable_mod_01 = 15.9949 M 2
variable_mod_02 = 42.0106 [^ 1
variable_mod_04 = -17.0265 nQ 1
variable_mod_05 = -18.0106 nE 1
variable_mod_07 = 0.98402 N 1
digest_min_length = 8
digest_max_length = 45

Maybe you could reduce some of the settings. Not sure what your sample is so can't tell what is the best strategy.

Best,

Fengchao

ako81818 commented 6 days ago

Samples are of endogenous serum peptides hence the need for nonenzyme search in MsFragger. So it is running out of memory because of the search space size rather than the number of ions? I'm thinking then to run the workflow twice but change the min-max length range for each (e.g., 1 with 8min-19max length and 2 with 20min-45max length)? I then can combine the DIA-NN results - do you think this would get me past the OutOfMemory issue here?

Many thanks, Andrew

fcyu commented 6 days ago

So it is running out of memory because of the search space size rather than the number of ions?

Yes, it is mostly due to the large search space, not your LC-MS data size.

I'm thinking then to run the workflow twice but change the min-max length range for each (e.g., 1 with 8min-19max length and 2 with 20min-45max length)? I then can combine the DIA-NN results - do you think this would get me past the OutOfMemory issue here?

Yes, this should work. You could also remove

variable_mod_04 = -17.0265 nQ 1
variable_mod_05 = -18.0106 nE 1
variable_mod_07 = 0.98402 N 1

Best,

Fengchao

ako81818 commented 6 days ago

Hi Fengchao,

Looking at the end of the log (just before it crashed) I'm perplexed by the "Number of unique peptides" table shown... for a no-enzyme search, I would expect the largest number of unique peptides at a length of 8 and then fewer and fewer as length increases when parsing through a fasta. But it is reporting around the same number (17 million) peptides with lengths 8 through 40 and then rolls off. Am I hitting a limitation (some maximum) in how MsFragger is parsing the fasta here? I want to make sure it is making it through the entire fasta and not arbitrarily stopping when it reaches some maximum at each length. Is there some sort of max fasta file size that can be used in no-enzyme searches?

Thanks again, Andrew

fcyu commented 6 days ago

What you said is obvious with enzymatic digestion. With non-enzymatic digestion, this trend is not clear. For example, given a protein with length > 9, there are only one fewer peptide comparing length 9 and length 8. Also note that it is "number of unique peptides": peptides with the same sequence are collapsed.

Is there some sort of max fasta file size that can be used in no-enzyme searches?

No.

Best,

Fengchao

ako81818 commented 6 days ago

Hi Fengchao,

Sharing back good news that with limiting search length to the 8-27 range, I was able to finish the nonspecific-HLA-diaPASEF workflow. I'm curious (wasn't discussed in the paper) why ProteinProphet was selected for FDR (validation) with the flags --sequential --prot 1?

It seems that the data are not being filtered at the protein level anyways, so why have --prot 1 at all? If I want each output file to be filtered to the 5% level, how would I make that change since --sequential does not take a float value?

My project has endogenous peptides that generally result in a single peptide per protein ID. In this case, would PeptideProphet be a more effective tool, or is it really only good for low-res data (not Bruker diaPASEF data)?

Regards, Andrew

fcyu commented 6 days ago

Glad to hear that it finally works for your data.

why ProteinProphet was selected for FDR (validation) with the flags --sequential --prot 1?

That's because for the endogenous peptides, people normally don't care about the protein level results, and the peptides have already filtered with 1% peptide-level FDR. But yes, we might change it back to --prot 0.01 in the next release to make the result more conservative.

It seems that the data are not being filtered at the protein level anyways, so why have --prot 1 at all?

Because if not have --prot 1, it will use the default setting: --prot 0.01.

If I want each output file to be filtered to the 5% level, how would I make that change since --sequential does not take a float value?

There are several levels of FDR: --psm, --ion, --pep, and --prot. Adjust them as you want. https://github.com/Nesvilab/philosopher/wiki/Filter

My project has endogenous peptides that generally result in a single peptide per protein ID. In this case, would PeptideProphet be a more effective tool, or is it really only good for low-res data (not Bruker diaPASEF data)?

I am not sure if I understand your question correctly, because "a single peptide per protein ID", "PeptideProphet be a more effective tool", and "is it really only good for low-res data" seems to have no causal relationship.

Best,

Fengchao

ako81818 commented 6 days ago

Thanks again for the guidance today.

In the diaTracer paper, it is suggested that, in addition to the nonspecific-HLA-diaPASEF workflow creating a spectral library from the diaPASEF data, you can also point to an existing spectral library (from prior DDA runs) to be used together for annotating the DIA-NN quant output. Where in FragPipe would I point to the existing library to append with the one being generated from diaPASEF runs? Is it that optional reference box for a library on the DIA-NN tab or somewhere else?

Best regards, Andrew

fcyu commented 6 days ago

You are welcome!

Where in FragPipe would I point to the existing library to append with the one being generated from diaPASEF runs? Is it that optional reference box for a library on the DIA-NN tab or somewhere else?

It is the Spectral library (optional) panel in the Quant (DIA) tab.

You could also load both of your diaPASEF and ddaPASEF .d folders, specify the diaPASEF as the DIA data type and the ddaPASEF as the DDA data type. Then, FragPipe will search both data types and build a "hybrid" library. This library will be used to perform the quantification during the DIA-NN step. It normally results in more quantified IDs compared to DIA-only or DDA-only library.

Best,

Fengchao

Nesvilab / FragPipe

Questions about diaTracer and endogenous peptides search #1644