Closed ypriverol closed 2 years ago
Hi All, below are some thoughts how to structure the workflow.
Step 1. Generate an in silico predicted spectral library from a FASTA sequence database (in Uniprot format). This step does not require any raw files. The slowdown observed was because it was carried out for each raw file separately, whereas it should only be carried out once per FASTA.
For most DIA experiments, it's better to use the default DIA-NN settings: max 1 missed cleavage, peptide length range 7-30, no variable modifications. Anything else will lead to a slowdown and, potentially, slightly worse performance. Yes, if there's really huge amount of methionine oxidation, then including M(ox) as variable mod (and setting max var mods to some reasonable number, e.g. 1 or 2) might be ultimately beneficial, but I am yet to see a dataset where it would provide a significant advantage. Protein N-terminal acetylation is typically fine to include but it will also yield to a minor slowdown due to PTM scoring overhead.
Result: *.predicted.speclib file containing an in silico library.
Step 2. Preliminary analysis of individual raw files. Now, one could of course do everything automatically as part of MBR analysis, but, as I understand, the goal is to make things as parallelisable as possible, so let's discuss how to do this 'manually'. Thus, step 2 consists of analysing each run separately with the in silico library generated in step 1. The mass accuracies and the scan window settings in DIA-NN should be either fixed or left automatic. In the latter case, please use --individual-mass-acc and --individual-windows. If mass accuracies are automatic, please also supply the --quick-mass-acc command. Specify a folder with --temp where .quant files will be saved to.
Result: *.quant file for each raw file that contains IDs and quant info; log file.
Step 3. Assemble an empirical spectral library from the .quant files. For this all .quant files must be accessible for the given instance of DIA-NN. No access to the raw data is required, if DIA-NN is supplied with the --rt-profiling command, which I strongly recommend. This corresponds to "IDs, RT & IM profiling" mode of library generation, which I would recommend for this workflow. To execute this step, run DIA-NN with the --use-quant command, listing all raw files with --f though (but they don't need to be accessible, these file names are only used to infer .quant files file names). The location of .quant files can be specified with --temp, please see the docs. If the mass accuracies and the scan window settings in DIA-NN were automatic during Step 2, please use --individual-mass-acc and/or --individual-windows, respectively.
Result: *.tsv spectral library.
Step 4. Final analysis of individual raw files. Could now use the empirical spectral library to analyse all files in one go (much faster than the preliminary step), but let's consider again how to do things manually with analysis steps for each run being separate. As in step 2, need to process each run with the empirical spectral library. Now, mass accuracies & scan window must be fixed here. In case they were not fixed for Step 2, in the log file produced by the Step 3, there's a line like "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7": need to extract the settings from this line and pass to DIA-NN during this search step. Specify a folder with --temp where .quant files will be saved to, make sure it's a different folder from the one used previously, so that old .quant files (preliminary analysis) are not overwritten. For best protein inference during the next step, need to specify a FASTA database during this step and also make sure protein inference is enabled.
Result: *.quant file for each raw file that contains IDs and quant info.
Step 5. Summarise the info. For this need the .tsv empirical spectral library, the FASTA database to annotate it and the newly created during step 4 .quant files. Run DIA-NN specifying the raw file names with --f (but they again don't need to be accessible, these file names are only used to infer .quant files file names) and with --use-quant, with --temp specifying the .quant files location. Here indicate --relaxed-prot-inf (for FragPipe-like protein inference) and --pg-level 2 (implicit grouping by gene names). --pg-level 0 is also OK (implicit grouping by sequence IDs). If using --pg-level 2 and having multiple species in the database, can add --species-genes (adds a suffix to gene names indicating the species - works provided the database is Uniprot-formatted).
Result: DIA-NN output files.
Some notes:
@daichengxin Lets start to implement this approach for the next release of the pipeline. I will add some lines here to clarify the pipeline steps:
Hi All, below are some thoughts how to structure the workflow.
Step 1. Generate an in silico predicted spectral library from a FASTA sequence database (in Uniprot format). This step does not require any raw files. The slowdown observed was because it was carried out for each raw file separately, whereas it should only be carried out once per FASTA.
For most DIA experiments, it's better to use the default DIA-NN settings: max 1 missed cleavage, peptide length range 7-30, no variable modifications. Anything else will lead to a slowdown and, potentially, slightly worse performance. Yes, if there's really huge amount of methionine oxidation, then including M(ox) as variable mod (and setting max var mods to some reasonable number, e.g. 1 or 2) might be ultimately beneficial, but I am yet to see a dataset where it would provide a significant advantage. Protein N-terminal acetylation is typically fine to include but it will also yield to a minor slowdown due to PTM scoring overhead.
Result: *.predicted.speclib file containing an in silico library.
Step 1 library generation from FASTA: Default parameters:
Step 2. Preliminary analysis of individual raw files. Now, one could of course do everything automatically as part of MBR analysis, but, as I understand, the goal is to make things as parallelisable as possible, so let's discuss how to do this 'manually'. Thus, step 2 consists of analysing each run separately with the in silico library generated in step 1. The mass accuracies and the scan window settings in DIA-NN should be either fixed or left automatic. If mass accuracies are automatic, please also supply the --quick-mass-acc command. Specify a folder with --temp where .quant files will be saved to.
Result: *.quant file for each raw file that contains IDs and quant info; log file.
--quick-mass-acc
unless the following parameters are provided: --mass-acc 10
, --mass-acc-ms1 20
@vdemichev in this original analysis we should remove some parameters related with inference, etc. because this is done in the last step 5.? Any other parameter related with quantitation, inference, etc. that will speed this step.
Step 3. Assemble an empirical spectral library from the .quant files. For this all .quant files must be accessible for the given instance of DIA-NN. No access to the raw data is required, if DIA-NN is supplied with the --rt-profiling command, which I strongly recommend. This corresponds to "IDs, RT & IM profiling" mode of library generation, which I would recommend for this workflow. To execute this step, run DIA-NN with the --use-quant command, listing all raw files with --f though (but they don't need to be accessible, these file names are only used to infer .quant files file names). The location of .quant files can be specified with --temp, please see the docs.
Result: *.tsv spectral library.
Nothing to be clarified here 👍
Step 4. Final analysis of individual raw files. Could now use the empirical spectral library to analyse all files in one go (much faster than the preliminary step), but let's consider again how to do things manually with analysis steps for each run being separate. As in step 2, need to process each run with the empirical spectral library. Now, mass accuracies & scan window must be fixed here. In the log file produced by the Step 2, there's a line like "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7". Need to extract the settings from this line and pass to DIA-NN during this search step. Specify a folder with --temp where .quant files will be saved to, make sure it's a different folder from the one used previously, so that old .quant files (preliminary analysis) are not overwritten.
Result: *.quant file for each raw file that contains IDs and quant info.
@vdemichev when we extract this from the log files "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" in Step 2, Then run each independent file against the "summarize" library with their corresponding settings extracted from the log?
Step 2
Step 5. Summarise the info. For this need the .tsv empirical spectral library, the FASTA database to annotate it and the newly created during step 4 .quant files. Run DIA-NN specifying the raw file names with --f (but they again don't need to be accessible, these file names are only used to infer .quant files file names) and with --use-quant, with --temp specifying the .quant files location. Here indicate --relaxed-prot-inf (for FragPipe-like protein inference) and --pg-level 2 (implicit grouping by gene names). --pg-level 0 is also OK (implicit grouping by sequence IDs). If using --pg-level 2 and having multiple species in the database, can add --species-genes (adds a suffix to gene names indicating the species - works provided the database is Uniprot-formatted).
Result: DIA-NN output files.
Nothing to clarify here 👍
@vdemichev in this original analysis we should remove some parameters related with inference, etc. because this is done in the last step 5.? Any other parameter related with quantitation, inference, etc. that will speed this step.
Indeed, protein inference only happens at Step 5. Quantification - at Steps 4 and 5. These algorithms however don't take any CPU time.
@vdemichev when we extract this from the log files "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" in Step 2, Then run each independent file against the "summarize" library with their corresponding settings extracted from the log?
This log line will be displayed during Step 3 (sorry, I initially incorrectly wrote that it's Step 2; in fact it is Step 3, now also corrected it above) and will provide recommended averages, based on summarising the analysis carried out at Step 2. So these settings will be the same for all runs during Step 4.
max miss cleavages
min peptide length
max peptide length
max_mods
. These parameters are shared with msgf
and comet
, Do I need to modify the default values in nextflow.config?
The current modification parameters come directly from the SDRF file. no variable modifications and variable modifications like M(ox) and N-term(ac) should be provided by SDRF
meaning I only keep these two as variable modifications and filter out other variable modifiers?
@daichengxin
max_mods
in DIA analysis. Sept 1 step finished in #169
Also in regular closed search, I would not advise searching for more 2-3 max_mods right?
The decision should actually be made between 2 and 3. For diann the default should be 2, but in DDA the most common used is 3 😏
Ok but I think after generating the library only once, this might not be the bottleneck anymore.. let's see
The larger the search space (the library), also the slower the main analysis. 3 variable modificaitons is quite a lot.
Because the pipeline share the variable with the DDA (and still DDA is more popular 😝) we will put as default 3, but we will let that clear in the documentation and probably also in the pipeline. @vdemichev just to be clear the user can pass that variable, we are talking here about the default value.
Yes, can specify any number here.
Hi @vdemichev, In step 3: Assemble an empirical spectral library from the .quant files. Does the --lib
parameter specify the library generated by the first step or the second step ?
ping @vdemichev ☝️ https://github.com/bigbio/quantms/issues/164#issuecomment-1120653490
@daichengxin, --use-quant should notmally be used with exactly the same settings as the initial analysis. Meaning that --lib specifies the initial output library. If there are any difficulties, please let me know & attach full logs, including the log of DIA-NN producing the .quant files
@vdemichev Means to specify the silico-library generated in the first step? Full commands as follows in step 3 now:
diann --f ${(mzMLs as List).join(' --f ')} \\
--lib ${lib} \\
${min_pr_mz} \\
${max_pr_mz} \\
${min_fr_mz} \\
${max_fr_mz} \\
--threads ${task.cpus} \\
--out-lib empirical_library.tsv \\
--missed-cleavages $params.allowed_missed_cleavages \\
--min-pep-len $params.min_peptide_length \\
--max-pep-len $params.max_peptide_length \\
--min-pr-charge $params.min_precursor_charge \\
--max-pr-charge $params.max_precursor_charge \\
--var-mods $params.max_mods \\
--cut {enzyme} \\
--var_ptm { from sdrf } \\
--fixed_ptm { from sdrf} \\
--verbose $params.diann_debug \\
--rt-profiling \\
--temp ./quant/ \\
--use-quant \\
--predictor \\
|& tee assemble_empirical_library.log
the log file does not have "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" info in Step 3. Full logs as follows:
@daichengxin
Means to specify the silico-library generated in the first step?
Yes.
Thanks a lot @vdemichev. I just updated steps 2 and 3: add --individual-mass-acc
and --individual-windows
. Then remove --predictor
in step 3. But the average recommendation settings are produced at step 2 and not at step 3. Finally I uploaded step 4:
Step 4:
mass_acc = params.mass_acc_ms2
scan_window = params.scan_window
ms1_accuracy = params.mass_acc_ms1
time_corr_only = params.time_corr_only ? "--time-corr-only" : ""
if (params.mass_acc_automatic | params.scan_window_automatic){
mass_acc = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 11 | tr -cd \"[0-9]\")"
scan_window = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 19 | tr -cd \"[0-9]\")"
ms1_accuracy = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 15 | tr -cd \"[0-9]\")"
}
"""
diann --lib ${library} \\
--f ${mzML} \\
${min_pr_mz} \\
${max_pr_mz} \\
${min_fr_mz} \\
${max_fr_mz} \\
--threads ${task.cpus} \\
--missed-cleavages $params.allowed_missed_cleavages \\
--min-pep-len $params.min_peptide_length \\
--max-pep-len $params.max_peptide_length \\
--min-pr-charge $params.min_precursor_charge \\
--max-pr-charge $params.max_precursor_charge \\
--var-mods $params.max_mods \\
--verbose $params.diann_debug \\
--temp ./ \\
--cut {enzyme} \\
--var_ptm {variable modifications from sdrf input by users} \\
--fixed_ptm {fixed modifications from sdrf input by users} \\
--min-corr $params.min_corr \\
--corr-diff $params.corr_diff \\
--mass-acc \$(echo ${mass_acc}) \\
--mass-acc-ms1 \$(echo ${ms1_accuracy}) \\
--window \$(echo ${scan_window}) \\
${time_corr_only} \\
@daichengxin there's something wrong, average recommended settings cannot be produced during Step 2, only Step 3. Can you please share the logs?
@vdemichev. These are logs in step2 and step3 log in step 2 report.log.txt
log in step 3 report.log.txt
Need in Step 3 one of the two commands that indicate that the mass accs or scan window need to be determined separately for different runs, please see above.
So is my current second step command correct? It also produced average recommended settings.
Step 2 Commands : diann --cut K*,R*,!*P --fixed-mod Carbamidomethyl,57.021464,C --var-mod Oxidation,15.994915,M --lib lib.predicted.speclib --f RD139_Narrow_UPS1_0_1fmol_inj1.mzML --min-pr-mz 350 --max-pr-mz 950 --min-fr-mz 500 --max-fr-mz 1500 --threads 2 --missed-cleavages 1 --min-pep-len 15 --max-pep-len 30 --min-pr-charge 2 --max-pr-charge 3 --var-mods 2 --verbose 3 --individual-windows --temp ./ --min-corr 2.0 --corr-diff 1.0 --quick-mass-acc --individual-mass-acc --time-corr-only
Indicated this in the main description now.
Nice. Thanks a lot @vdemichev. I am working.
Yes, Step 2 is correct, except it just produces recommended settings for a single run, which is not something of interest , so please just ignore this message in Step 2
Hi @vdemichev . This is full command in step 4. Could you check it? The recommended settings are extracted from log in step3 and passed to step4. Thanks a lot in advance!
if (params.mass_acc_automatic | params.scan_window_automatic){
mass_acc = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 11 | tr -cd \"[0-9]\")"
scan_window = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 19 | tr -cd \"[0-9]\")"
ms1_accuracy = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 15 | tr -cd \"[0-9]\")"
}
"""
diann --lib ${library} \\
--f ${mzML} \\
${min_pr_mz} \\
${max_pr_mz} \\
${min_fr_mz} \\
${max_fr_mz} \\
--threads ${task.cpus} \\
--missed-cleavages $params.allowed_missed_cleavages \\
--min-pep-len $params.min_peptide_length \\
--max-pep-len $params.max_peptide_length \\
--min-pr-charge $params.min_precursor_charge \\
--max-pr-charge $params.max_precursor_charge \\
--var-mods $params.max_mods \\
--verbose $params.diann_debug \\
--temp ./ \\
--min-corr $params.min_corr \\
--corr-diff $params.corr_diff \\
--mass-acc \$(echo ${mass_acc}) \\
--mass-acc-ms1 \$(echo ${ms1_accuracy}) \\
--window \$(echo ${scan_window}) \\
--no-ifs-removal \\
--cut {enzyme from sdrf input by user} \\
--var-mod {var_mods from sdrf input by user} \\
--fixed-mod { fixed_mods from sdrf input by user} \\
${time_corr_only} \\
Full command in step 5:
diann --lib ${empirical_library} \\
--fasta ${fasta} \\
--f ${(mzMLs as List).join(' --f ')} \\
${min_pr_mz} \\
${max_pr_mz} \\
${min_fr_mz} \\
${max_fr_mz} \\
--threads ${task.cpus} \\
--missed-cleavages $params.allowed_missed_cleavages \\
--min-pep-len $params.min_peptide_length \\
--max-pep-len $params.max_peptide_length \\
--min-pr-charge $params.min_precursor_charge \\
--max-pr-charge $params.max_precursor_charge \\
--var-mods $params.max_mods \\
--verbose $params.diann_debug \\
${scan_window} \\
--min-corr $params.min_corr \\
--corr-diff $params.corr_diff \\
${mass_acc} \\
${time_corr_only} \\
--cut ${enzyme from sdrf input bu user} \\
--var-mod ${var mods from sdrf input by user} \\
--fixed-mod ${ fixed mods from sdrf input by user} \\
--temp ./quant/ \\
--relaxed-prot-inf \\
--pg-level $params.pg_level \\
${species_genes} \\
--use-quant \\
--out diann_report.tsv \\
Are mass-acc
mass-acc-ms1
and window
parameters still needed here? and from the recommended settings
Hi @daichengxin, Step 4 I would recommend to run without --min-corr and --corr-diff and without --time-corr-only. Also, during Step 5 these have no effect. --cut, --var-mod and --fixed-mod as well as all other digest settings have no effect during Steps 4 and 5. Suring Step 5 I would add --matrices. Otherwise looks good. Please let me see the full DIA-NN logs from all Steps 1-5 once the pipeline is set up.
ok.--min-pr-mz
, --max-pr-mz
, --min-fr-mz,
--max-fr-mz
, --min-pep-len
, --max-pep-len
, --min-pr-charge
and --max-pr-charge
should also have no effect during step4 and 5? @vdemichev
Yes
Hi @vdemichev This full DIA-NN logs from all Steps 1-5.
step1 log: lib.log.txt
step2 log: report.log.txt
step3 log: report.log.txt
step4 log: report.log.txt
step5 log: diann_report.log.txt
Seems good! Some comments: Step 1:
Steps 2 to 5: digest settings have no effect. Step 4: can add --no-main-report, as don't need a report; can also turn off protein inference, as it's not necessary at this stage.
Yes, min 15
is because we need CI/CD to be running fast, it is the same case for min fragment m/z set to 500. Can you check the other comments @daichengxin
Thanks a lot @vdemichev for your support
Thanks a lot @vdemichev . There are updated logs from step2 to 5. Could you check again?
step 2: report.log.txt
step 3: report.log.txt
step 4: report.log.txt
step 5: diann_report.log.txt
Looks good!
@Thanks to @vdemichev and @daichengxin for working in the first iteration of DIANN with quantms. Here the first full dataset reanalyzed with DIANN http://ftp.pride.ebi.ac.uk/pride/data/proteomes/differential-expression/PXD004684-disease/
The volcano plot result of MSstats can be found here.
Some comments:
Thanks @vdemichev for the feedback.
Hi @vdemichev . This is the log for step 4 with fasta, Could you check again ? report.log.txt
With or without? It looks like he suggested without if I am not mistakenm
@vdemichev How we combine the FDR parameters (q-value thresholds in each step). I guess we MUST pass the FDR threshold in the last step?.
--matrix-ch-qvalue [x] sets the 'channel q-value' used to filter the output matrices
--matrix-qvalue [x] sets the q-value used to filter the output matrices
--matrix-tr-qvalue [x] sets the 'translated q-value' used to filter the output matrices
--matrix-spec-q run-specific protein q-value filtering will be used, in addition to the global q-value filtering, when saving protein matrices. The ability to filter based on run-specific protein q-values, which allows to generate highly reliable data, is one of the advantages of DIA-NN
For DDA data, we use the PSM and Protein FDR at the project level (meaning last step). We also filter FDR at run
level PSM FDR with 0.10% by ms_run to be more stringent.
@daichengxin Step 4 log seems good! @ypriverol FDR thresholds during steps 2-4 should be set to 1%. During Step 5, can use any threshold. Filtering stricter than 1% might significantly reduce IDs - it's always better to analyse at 1%, and then the user can filter the output for their specific workflow. --matrix-spec-q and --matrix-qvalue affect matrices only, not the main report. I suggest not using these, as default settings for matrices seem the best for most datasets, and those users who want more stringent can filter the main report. But good to include matrices too, for those who don't use R/Python.
@vdemichev correct me If I'm wrong:
--matrix-qvalue = 0.01
), the user will be able to set different values in case they want to get more ids, but by default with be 0.01. --matrix-spec-q
: we will pass this default value 1% (this is 0.01) to run protein filter, if the users wants to define their own value to get less highlight reliable data, they can relax this parameter. In reanalysis because we can't check in details each dataset, we may use this parameter as default 0.01 for high reliable results. --matrix-qvalue
: Configurable also in the Step 5 for the user if they want to get more ids they may use a less stringent FDR let say 0.05. We will not use in any step the --matrix-tr-qvalue
or --matrix-ch-qvalue
.
Please correct me if I understand the parameter's combination well.
Hi @ypriverol, matrix q-value setting only controls the q-value filtering of matrix output, nothing else. So of no relevance for Steps 1-4. For Steps 2-4, please only use --qvalue 0.01 (but 0.01 is the default value, so can also just omit this).
--matrix-spec-q: configures run-specific protein FDR for matrix output. Other software simply does not have this setting at all, filtering on run-specific protein FDR is questionable (many advise against it) and reduces data completeness. So for most applications it's better not to filter, and DIA-NN by default does not filter. Note that global protein FDR is always applied to matrices regardless of this setting.
--matrix-qvalue - again, better to keep 0.01 (default) for most applications. Whoever wants stricter filtering can always filter the main report manually. 0.05 is too relaxed for most applications. As matrices cannot be filtered further, better to stick to a reasonable value which is 0.01.
Hi @vdemichev . Is it reasonable to manually filter the value of the column Q.Value
in main report so that downstream analysis?
Yes, if the --qvalue was set to value higher than 0.01 (this is OK only for Step 5), then the output report must be filtered at Q.Value <= 0.01. Applying more strict filtering than 0.01 will result in more missing values & likely worse downstream analysis. I suggest not to do that, as tghis strict filtering will inevitably result in mahy people looking at the data and thinking, falsely, that the ID numbers are low and missing value rate is high...
Description of feature
Dear @daichengxin @vdemichev @jpfeuffer:
I have started testing the DIANN pipeline in quantms. The main problem I found is that still the pipeline is really slow. However, after talking to @vdemichev some other issues should be solved. First two ideas about what do we want to solve between DIANN and quantms:
quantms + diann should be able to:
The current pipeline/parallelization does the following :
This actually generates a file (diann_config.cfg) like containing the following information:
Each raw file in the SDRF is converted to mzML; or the mzML that was converted outside the pipeline (ABSiex case) is used for the analysis.
For each mzML, we generate in parallel the corresponding theoretical spectral library using the following command:
Note: This step is extremely slow. In addition, @vdemichev as commented me that generating a spectral library by raw file will give you wrong results in the next step (step 4) when searching all the data together and perform the quantification (q-value wrong).
Discussing with @vdemichev about performance, he mentioned that this approach will give you wrong statistical results.
@vdemichev
Lets discuss the ideas in this thread in details.