I've run my analyses on the exact same inputs twice (CrystalC results), once following the script for open search, and once attempting to convert the given parameters into the Philosopher pipeline yaml.
I'm confused as to what the default is for some of the parameters that are not specified, and why some of the names of parameters don't match up with options in the yaml.
Can you please have a look at both the script and the yaml and let me know if this is an accurate translation?
I also have a specific question, the results of my comparison already differs at the peptideprophet step. In the yaml, ppm and accmass are set to true by defaut, but it seems like they must be added as parameters on the command line. Should these two parameters be added for open search analysis? There's not that much description on the github manual page.
(there was something wrong with the mass calibration for one of my batches of samples so mass tolerance is set to 30ppm)
yaml: (I've removed the steps that are not needed)
analytics: true # reports when a workspace is created for usage statistics
commands:
workspace: no # manage the experiment workspace for the analysis
database: no # target-decoy database formatting
comet: no # peptide spectrum matching with Comet
msfragger: no # peptide spectrum matching with MSFragger
peptideprophet: no # peptide assignment validation
ptmprophet: no # PTM site localization
proteinprophet: no # protein identification validation
filter: yes # statistical filtering, validation and False Discovery Rates assessment
freequant: yes # label-free Quantification
labelquant: no # isobaric Labeling-Based Relative Quantification
bioquant: no # protein report based on Uniprot protein clusters
report: yes # multi-level reporting for both narrow-searches and open-searches
abacus: no # combined analysis of LC-MS/MS results
tmtintegrator: no # integrates channel abundances from multiple TMT samples
database:
protein_database: /reference/reference.fasta # path to the target-decoy protein database
decoy_tag: rev_ # prefix tag used added to decoy sequences
peptideprophet: # v5.2
concurrent: yes # Concurrent execution of multiple instaces
extension: pepXML # pepXML file extension
clevel: -2 # set Conservative Level in neg_stdev from the neg_mean, low numbers are less conservative, high numbers are more conservative
accmass: true # use Accurate Mass model binning
decoyprobs: true # compute possible non-zero probabilities for Decoy entries on the last iteration
enzyme: trypsin # enzyme used in sample (optional)
exclude: false # exclude deltaCn*, Mascot*, and Comet* results from results (default Penalize * results)
expectscore: true # use expectation value as the only contributor to the f-value for modeling
forcedistr: false # bypass quality control checks, report model despite bad modeling
glyc: false # enable peptide Glyco motif model
icat: false # apply ICAT model (default Autodetect ICAT)
instrwarn: false # warn and continue if combined data was generated by different instrument models
leave: false # leave alone deltaCn*, Mascot*, and Comet* results from results (default Penalize * results)
maldi: false # enable MALDI mode
masswidth: 1000.0 # model mass width (default 5)
minpeplen: 7 # minimum peptide length not rejected (default 7)
minpintt: 2 # minimum number of NTT in a peptide used for positive pI model (default 2)
minpiprob: 0.9 # minimum probability after first pass of a peptide used for positive pI model (default 0.9)
minprob: 0.05 # report results with minimum probability (default 0.05)
minrtntt: 2 # minimum number of NTT in a peptide used for positive RT model (default 2)
minrtprob: 0.9 # minimum probability after first pass of a peptide used for positive RT model (default 0.9)
neggamma: false # use Gamma distribution to model the negative hits
noicat: false # do no apply ICAT model (default Autodetect ICAT)
nomass: false # disable mass model
nonmc: false # disable NMC missed cleavage model
nonparam: true # use semi-parametric modeling, must be used in conjunction with --decoy option
nontt: false # disable NTT enzymatic termini model
optimizefval: false # (SpectraST only) optimize f-value function f(dot,delta) using PCA
phospho: false # enable peptide Phospho motif model
pi: false # enable peptide pI model
ppm: true # use PPM mass error instead of Dalton for mass modeling
zero: false # report results with minimum probability 0
proteinprophet: # v5.2
accuracy: false # equivalent to --minprob 0
allpeps: false # consider all possible peptides in the database in the confidence model
confem: false # use the EM to compute probability given the confidence
delude: false # do NOT use peptide degeneracy information when assessing proteins
excludezeros: false # exclude zero prob entries
fpkm: false # model protein FPKM values
glyc: false # highlight peptide N-glycosylation motif
icat: false # highlight peptide cysteines
instances: false # use Expected Number of Ion Instances to adjust the peptide probabilities prior to NSP adjustment
iprophet: false # input is from iProphet
logprobs: false # use the log of the probabilities in the Confidence calculations
maxppmdiff: 2000000 # maximum peptide mass difference in PPM (default 20)
minprob: 0.05 # peptideProphet probabilty threshold (default 0.05)
mufactor: 1 # fudge factor to scale MU calculation (default 1)
nogroupwts: false # check peptide's Protein weight against the threshold (default: check peptide's Protein Group weight against threshold)
nonsp: false # do not use NSP model
nooccam: false # non-conservative maximum protein list
noprotlen: false # do not report protein length
normprotlen: false # normalize NSP using Protein Length
protmw: false # get protein mol weights
softoccam: false # peptide weights are apportioned equally among proteins within each Protein Group (less conservative protein count estimate)
unmapped: false # report results for UNMAPPED proteins
filter:
psmFDR: 0.01 # psm FDR level (default 0.01)
peptideFDR: 0.01 # peptide FDR level (default 0.01)
ionFDR: 0.01 # peptide ion FDR level (default 0.01)
proteinFDR: 0.01 # protein FDR level (default 0.01)
peptideProbability: 0.7 # top peptide probability threshold for the FDR filtering (default 0.7)
proteinProbability: 0.5 # protein probability threshold for the FDR filtering (not used with the razor algorithm) (default 0.5)
peptideWeight: 1 # threshold for defining peptide uniqueness (default 1)
razor: true # use razor peptides for protein FDR scoring
picked: false # apply the picked FDR algorithm before the protein scoring
mapMods: true # map modifications acquired by an open search
models: false # print model distribution
sequential: true # alternative algorithm that estimates FDR using both filtered PSM and Protein lists
freequant:
peakTimeWindow: 0.4 # specify the time windows for the peak (minute) (default 0.4)
retentionTimeWindow: 3 # specify the retention time window for xic (minute) (default 3)
tolerance: 30 # m/z tolerance in ppm (default 10) # precursor mass tolerance need to be 30 for cpcgene
isolated: false # use the isolated ion instead of the selected ion for quantification
report:
msstats: true # create an output compatible to MSstats
withDecoys: true # add decoy observations to reports
mzID: true # create a mzID output
Describe the bug I am following the instructions on "Running a FragPipe-equivalent workflow on Linux", here: https://msfragger.nesvilab.org/tutorial_linux.html
I've run my analyses on the exact same inputs twice (CrystalC results), once following the script for open search, and once attempting to convert the given parameters into the Philosopher pipeline yaml.
I'm confused as to what the default is for some of the parameters that are not specified, and why some of the names of parameters don't match up with options in the yaml.
Can you please have a look at both the script and the yaml and let me know if this is an accurate translation?
I also have a specific question, the results of my comparison already differs at the peptideprophet step. In the yaml, ppm and accmass are set to true by defaut, but it seems like they must be added as parameters on the command line. Should these two parameters be added for open search analysis? There's not that much description on the github manual page.
script:
(there was something wrong with the mass calibration for one of my batches of samples so mass tolerance is set to 30ppm)
yaml: (I've removed the steps that are not needed)