Nesvilab / MSFragger

Ultrafast, comprehensive peptide identification for mass spectrometry–based proteomics
https://msfragger.nesvilab.org
106 stars 7 forks source link

Recalibration with sliced database #28

Closed JB91451 closed 5 years ago

JB91451 commented 5 years ago

Dear All,

I'm conducting an open mass window search with semi-specific cleavage on a ~15mb database. Using a VM with 500GB memory this worked fine with the previous release of MSFragger. However since we had some lock mass errors I would really like to use the automatic recalibration of the latest version. Unfortunately it seems that for this function the database slicing does not apply. I choose almost all options to reduce search space (incl. fully specific digest, less peptide length variability, no variable modifications, ...) but there always occurs a memory error. When checking the console, it seems that the problem is related to the fact that MSFragger tries to do the "Firstsearch" in one slice. Once I deactivate the automatic recalibration, my normal parameter file works perfectly. Thus I would like to know, if there is any way to influence the parameters for the first search separately? Or does somebody has an idea, how to avoid this error?

Best Juergen

fcyu commented 5 years ago

Hi Juergen,

Thanks for your interest in MSFragger.

MSFragger doesn't split in the First Search. However, I don't think MSFragger needs more than 500GB in the First Search based on your description because it is a tryptic closed search. Could you please send your fragger.params file to me so that I can take a look?

Thanks,

Fengchao

JB91451 commented 5 years ago

Dear Fengchao,

Thank you for your quick reply. I will send you the file as soon as I'm back in the institute on monday.

Have a nice weekend, Juergen

JB91451 commented 5 years ago

Dear Fengchao,

In annex I send you the promised parameter file which did not work, even when I removed all variable modifications, changed the precursor mass window to -150/500, used two enzymatic termini and reduced the digest length to 10-20 aa only. However once I loaded the file in FragPipe and removed the recalibration option it worked properly (I did not test this on command line mode but assume that this would also work.

Best, Juergen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub [1], or mute the thread [2].

Links:

[1] https://github.com/Nesvilab/MSFragger/issues/28?email_source=notifications&email_token=AMLL4NG2RMO3HGDQNSUUVKTP2PM57A5CNFSM4HYLOZG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXXRWMY#issuecomment-502209331 [2] https://github.com/notifications/unsubscribe-auth/AMLL4NGFTBLEEACTBHLX3NDP2PM57ANCNFSM4HYLOZGQ

num_threads = 0 # Number of CPU threads to use. database_name = X:\test.fasta # Path to the protein database file in FASTA format.

precursor_mass_lower = -500 # Lower bound of the precursor mass window. precursor_mass_upper = 500 # Upper bound of the precursor mass window. precursor_mass_units = 0 # Precursor mass tolerance units (0 for Da, 1 for ppm). precursor_true_tolerance = 20 # True precursor mass tolerance (window is +/- this value). precursor_true_units = 1 # True precursor mass tolerance units (0 for Da, 1 for ppm). fragment_mass_tolerance = 20 # Fragment mass tolerance (window is +/- this value). fragment_mass_units = 1 # Fragment mass tolerance units (0 for Da, 1 for ppm). calibrate_mass = 2 # Perform mass calibration (0 for OFF, 1 for ON, 2 for ON and find optimal parameters). decoyprefix = Reverse # Prefix added to the decoy protein ID.

isotope_error = 0 # Isotope correction for MS/MS events triggered on isotopic peaks. mass_offsets = 0 # Creates multiple precursor tolerance windows with specified mass offsets. precursor_mass_mode = selected # One of isolated/selected/recalculated.

localize_delta_mass = 1 # This allows shifted fragment ions - fragment ions with mass increased by the calculated mass difference, to be included in scoring. delta_mass_exclude_ranges = (-1.5,3.5) # Exclude mass range for shifted ions searching. fragment_ion_series = b,y # Ion series used in search.

search_enzyme_name = lysc-p # Name of enzyme to be written to the pepXML file. search_enzyme_cutafter = K # Residues after which the enzyme cuts. search_enzyme_butnotafter = # Residues that the enzyme will not cut before.

num_enzyme_termini = 1 # 0 for non-enzymatic, 1 for semi-enzymatic, and 2 for fully-enzymatic. allowed_missed_cleavage = 2 # Allowed number of missed cleavages.

clip_nTerm_M = 1 # Specifies the trimming of a protein N-terminal methionine as a variable modification (0 or 1).

maximum of 7 mods - amino acid codes, * for any amino acid,

[ and ] specifies protein termini, n and c specifies

peptide termini

variable_mod_01 = 0.984016 NQ variable_mod_02 = 15.994915 M variable_mod_03 = 43.005814 n*KRCM variable_mod_04 = -17.026549 nQ variable_mod_05 = -18.010565 nE

variable_mod_03 = 79.96633 STY

variable_mod_04 = -17.02650 nQnC

variable_mod_05 = -18.01060 nE

variable_mod_06 = 0.00000 site_06

variable_mod_07 = 0.00000 site_07

allow_multiple_variable_mods_on_residue = 1 # Allow each amino acid to be modified by multiple variable modifications (0 or 1). max_variable_mods_per_mod = 3 # Maximum number of residues that can be occupied by each variable modification (maximum of 5). max_variable_mods_combinations = 5000 # Maximum allowed number of modified variably modified peptides from each peptide sequence, (maximum of 65534).

output_file_extension = pep.xml # File extension of output files. output_format = pepXML # File format of output files (pepXML or tsv). output_report_topN = 5 # Reports top N PSMs per input spectrum. output_max_expect = 50 # Suppresses reporting of PSM if top hit has expectation greater than this threshold. report_alternative_proteins = 1 # Report alternative proteins for peptides that are found in multiple proteins (0 for no, 1 for yes).

precursor_charge = 1 4 # Assume range of potential precursor charge states. Only relevant when override_charge is set to 1. override_charge = 0 # Ignores precursor charge and uses charge state specified in precursor_charge range (0 or 1).

digest_min_length = 6 # Minimum length of peptides to be generated during in-silico digestion. digest_max_length = 55 # Maximum length of peptides to be generated during in-silico digestion. digest_mass_range = 500.0 12000.0 # Mass range of peptides to be generated during in-silico digestion in Daltons. max_fragment_charge = 3 # Maximum charge state for theoretical fragments to match (1-4).

track_zero_topN = 0 # Track top N unmodified peptide results separately from main results internally for boosting features. Should be set to a number greater than output_report_topN if zero bin boosting is desired. zero_bin_accept_expect = 0.00 # Ranks a zero-bin hit above all non-zero-bin hit if it has expectation less than this value. zero_bin_mult_expect = 1.00 # Multiplies expect value of PSMs in the zero-bin during results ordering (set to less than 1 for boosting). add_topN_complementary = 0 # Inserts complementary ions corresponding to the top N most intense fragments in each experimental spectra.

minimum_peaks = 15 # Minimum number of peaks in experimental spectrum for matching. use_topN_peaks = 150 # Pre-process experimental spectrum to only use top N peaks. min_fragments_modelling = 3 # Minimum number of matched peaks in PSM for inclusion in statistical modeling. min_matched_fragments = 5 # Minimum number of matched peaks for PSM to be reported. minimum_ratio = 0.001 # Filters out all peaks in experimental spectrum less intense than this multiple of the base peak intensity. clear_mz_range = 0.0 0.0 # Removes peaks in this m/z range prior to matching.

Fixed modifications

add_Cterm_peptide = 0.000000 add_Nterm_peptide = 0.000000 add_Cterm_protein = 0.000000 add_Nterm_protein = 0.000000 add_G_glycine = 0.000000 add_A_alanine = 0.000000 add_S_serine = 0.000000 add_P_proline = 0.000000 add_V_valine = 0.000000 add_T_threonine = 0.000000 add_C_cysteine = 57.021464 add_L_leucine = 0.000000 add_I_isoleucine = 0.000000 add_N_asparagine = 0.000000 add_D_aspartic_acid = 0.000000 add_Q_glutamine = 0.000000 add_K_lysine = 0.000000 add_E_glutamic_acid = 0.000000 add_M_methionine = 0.000000 add_H_histidine = 0.000000 add_F_phenylalanine = 0.000000 add_R_arginine = 0.000000 add_Y_tyrosine = 0.000000 add_W_tryptophan = 0.000000 add_B_user_amino_acid = 0.000000 add_J_user_amino_acid = 0.000000 add_O_user_amino_acid = 0.000000 add_U_user_amino_acid = 0.000000 add_X_user_amino_acid = 0.000000 add_Z_user_amino_acid = 0.000000

fcyu commented 5 years ago

Hi Juergen,

Thanks for your fragger.params file. I noticed that you have digest_mass_range = 500.0 12000.0, which is beyond the allowed range. We extended the range and released a fixed version to http://msfragger.arsci.com/upgrader/. Could you please download the latest version and try it again?

Thanks,

Fengchao

JB91451 commented 5 years ago

Dear Fengchao,

Thank you for the fast reply. It seems that it was only the digest_mass_range that caused the error. With the fixed version the search runs as expected. Currently it is in the state of calculating the best parameters for recalibration and I will keep you updated, if anything else should fail. Just out of curiosity: Was the extended range just ignored in previous versions? Because, as far as I remember, I used this large range already a few times without failure?

Thank you very much for your help, Juergen

fcyu commented 5 years ago

Hi Juergen,

Thanks for your feedback. Glad that it works for you.

The limited range was only applied to the enzymatic search. In your case, your main search was semi-enzymatic search, which will not be affected by this limit. The reason of you having this issue with the latest version is that there is an additional first search that uses enzymatic search in your case.

Bests,

Fengchao