DIANN changes in the current pipeline

ypriverol commented 2 years ago

Description of feature

Dear @daichengxin @vdemichev @jpfeuffer:

I have started testing the DIANN pipeline in quantms. The main problem I found is that still the pipeline is really slow. However, after talking to @vdemichev some other issues should be solved. First two ideas about what do we want to solve between DIANN and quantms:

quantms + diann should be able to:

reanalyze DIA-based data in PRIDE Archive annotated in SDRFs and Uniprot FASTA files.
export the results into standard file formats including mzTab + mzML (still pending issue here #119 that we will continue discussing)
use the capabilities of quantms to run DIANN distributed and cloud as much as possible.

The current pipeline/parallelization does the following :

Generate the "generic config files" for the analysis including the PTMs, Enzyme rules, etc. For that, we use the following command in one node:

prepare_diann_parameters.py generate \
    --enzyme "Trypsin" \
    --fix_mod "" \
    --var_mod "Oxidation (M),Carbamidomethyl (C)" \
    --precursor_tolerence 10 \
    --precursor_tolerence_unit ppm \
    --fragment_tolerence 0.05 \
    --fragment_tolerence_unit Da \
    > GENERATE_DIANN_CFG.log

This actually generates a file (diann_config.cfg) like containing the following information:

--dir ./mzMLs --cut K*,R*,!*P --var-mod Oxidation,15.994915,M --var-mod Carbamidomethyl,57.021464,C --mass-acc 10 --mass-acc-ms1 20 --matrices --report-lib-info

Each raw file in the SDRF is converted to mzML; or the mzML that was converted outside the pipeline (ABSiex case) is used for the analysis.
For each mzML, we generate in parallel the corresponding theoretical spectral library using the following command:

diann   `cat library_config.cfg` \
        --fasta Homo-sapiens-uniprot-reviewed-isoforms-contaminants-decoy-202105.fasta \
        --fasta-search \
        --f 20181116_QEHFX3_BP_RSLCcap_hHeart_DIA_58.mzML \
        --out-lib 20181116_QEHFX3_BP_RSLCcap_hHeart_DIA_58_lib.tsv \
        --min-pr-mz 350 \
        --max-pr-mz 1650 \
         \
         \
        --missed-cleavages 2 \
        --min-pep-len 6 \
        --max-pep-len 40 \
        --min-pr-charge 2 \
        --max-pr-charge 4 \
        --var-mods 3 \
        --threads 25 \
        --predictor \
        --verbose 3 \
        > diann.log

Note: This step is extremely slow. In addition, @vdemichev as commented me that generating a spectral library by raw file will give you wrong results in the next step (step 4) when searching all the data together and perform the quantification (q-value wrong).

The quantification step the pipeline uses all the mzMLs and libraries generated in the individual steps to perform the quantification and statistical assessment:

diann   `cat diann_config.cfg` \
        --lib N294-1_lib.tsv--lib N294-2_lib.tsv--lib N295-1_lib.tsv--lib N295-2_lib.tsv--lib N296-1_lib.tsv--lib N299-1_lib.tsv--lib N296-2_lib.tsv--lib N299-2_lib.tsv \
        --relaxed-prot-inf \
        --fasta Homo-sapiens-uniprot-reviewed-isoforms-contaminants-decoy-202105.fasta \
         \
         \
         \
         \
        --threads 25 \
        --missed-cleavages 2 \
        --min-pep-len 6 \
        --max-pep-len 40 \
        --min-pr-charge 2 \
        --max-pr-charge 4 \
        --var-mods 3 \
        --matrix-spec-q 0.01 \
         \
        --reannotate \
         \
        --out diann_report.tsv \
        --verbose 3 \
        > diann.log

Discussing with @vdemichev about performance, he mentioned that this approach will give you wrong statistical results.

@vdemichev

can you suggest which changes can be done to parallelize DIANN?
can you suggest which parameters can improve the performance of all the steps?

Lets discuss the ideas in this thread in details.

vdemichev commented 2 years ago

Hi All, below are some thoughts how to structure the workflow.

Step 1. Generate an in silico predicted spectral library from a FASTA sequence database (in Uniprot format). This step does not require any raw files. The slowdown observed was because it was carried out for each raw file separately, whereas it should only be carried out once per FASTA.

For most DIA experiments, it's better to use the default DIA-NN settings: max 1 missed cleavage, peptide length range 7-30, no variable modifications. Anything else will lead to a slowdown and, potentially, slightly worse performance. Yes, if there's really huge amount of methionine oxidation, then including M(ox) as variable mod (and setting max var mods to some reasonable number, e.g. 1 or 2) might be ultimately beneficial, but I am yet to see a dataset where it would provide a significant advantage. Protein N-terminal acetylation is typically fine to include but it will also yield to a minor slowdown due to PTM scoring overhead.

Result: *.predicted.speclib file containing an in silico library.

Step 2. Preliminary analysis of individual raw files. Now, one could of course do everything automatically as part of MBR analysis, but, as I understand, the goal is to make things as parallelisable as possible, so let's discuss how to do this 'manually'. Thus, step 2 consists of analysing each run separately with the in silico library generated in step 1. The mass accuracies and the scan window settings in DIA-NN should be either fixed or left automatic. In the latter case, please use --individual-mass-acc and --individual-windows. If mass accuracies are automatic, please also supply the --quick-mass-acc command. Specify a folder with --temp where .quant files will be saved to.

Result: *.quant file for each raw file that contains IDs and quant info; log file.

Step 3. Assemble an empirical spectral library from the .quant files. For this all .quant files must be accessible for the given instance of DIA-NN. No access to the raw data is required, if DIA-NN is supplied with the --rt-profiling command, which I strongly recommend. This corresponds to "IDs, RT & IM profiling" mode of library generation, which I would recommend for this workflow. To execute this step, run DIA-NN with the --use-quant command, listing all raw files with --f though (but they don't need to be accessible, these file names are only used to infer .quant files file names). The location of .quant files can be specified with --temp, please see the docs. If the mass accuracies and the scan window settings in DIA-NN were automatic during Step 2, please use --individual-mass-acc and/or --individual-windows, respectively.

Result: *.tsv spectral library.

Step 4. Final analysis of individual raw files. Could now use the empirical spectral library to analyse all files in one go (much faster than the preliminary step), but let's consider again how to do things manually with analysis steps for each run being separate. As in step 2, need to process each run with the empirical spectral library. Now, mass accuracies & scan window must be fixed here. In case they were not fixed for Step 2, in the log file produced by the Step 3, there's a line like "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7": need to extract the settings from this line and pass to DIA-NN during this search step. Specify a folder with --temp where .quant files will be saved to, make sure it's a different folder from the one used previously, so that old .quant files (preliminary analysis) are not overwritten. For best protein inference during the next step, need to specify a FASTA database during this step and also make sure protein inference is enabled.

Result: *.quant file for each raw file that contains IDs and quant info.

Step 5. Summarise the info. For this need the .tsv empirical spectral library, the FASTA database to annotate it and the newly created during step 4 .quant files. Run DIA-NN specifying the raw file names with --f (but they again don't need to be accessible, these file names are only used to infer .quant files file names) and with --use-quant, with --temp specifying the .quant files location. Here indicate --relaxed-prot-inf (for FragPipe-like protein inference) and --pg-level 2 (implicit grouping by gene names). --pg-level 0 is also OK (implicit grouping by sequence IDs). If using --pg-level 2 and having multiple species in the database, can add --species-genes (adds a suffix to gene names indicating the species - works provided the database is Uniprot-formatted).

Result: DIA-NN output files.

vdemichev commented 2 years ago

Some notes:

For dia-PASEF it is usually beneficial to set the precursor charge range to 2-3.
dia-PASEF files must be in .d (more compact) or .dia (faster, pre-converted by DIA-NN) format; conversion to .mzML is impossible here (there are tools for that, but DIA-NN will not understand such .mzML files).
Low RAM & high speed mode enabled by --min-corr 2.0 --corr-diff 1.0 --time-corr-only might prove useful during Step 2, it typically leads to no or very minimal drop in ID numbers.
High-precision quant mode enabled by --no-ifs-removal is what I would use, also if no peak broadening occured in the given experiment, can use --peak-center (Robust LC quant mode). This is only relevant for Step 4. Peak broadening can be detected based on DIA-NN's *.stats.tsv output file. Alternatively, during Step 2, DIA-NN prints sth like "Peak width: 3.196" - the average number of data points at FWHM. So long as this estimate is based on enough IDs (which is evident from Log level = 3 output), i.e. when analysing a proper non-blank run, it corresponds to peak width at FWHM. This allows to figure out with a script if peak broadening is of concern for the experiment (but need to ignore these values for blanks & in general runs with low IDs).
If PTM scoring is desired (e.g. for N-terminal protein acetylation), then at Steps 2-5 need to use --var-mod declaration and --monitor-mod for the modifications of interest. I strongly suggest not to do this for M(ox) though.
The --reannotate command should not be necessary
--matrix-spec-q 0.01 reduces data completeness of the output matrices considerably; other software does not even have this option, and most users seem fine with higher run-specific (but not global) protein FDR.
C(cam) should be fixed and not variable?
N-terminal methionine excision is typically good to enable? --met-excision.
Please let me see the scripts with settings once ready.

ypriverol commented 2 years ago

@daichengxin Lets start to implement this approach for the next release of the pipeline. I will add some lines here to clarify the pipeline steps:

Hi All, below are some thoughts how to structure the workflow.

Step 1. Generate an in silico predicted spectral library from a FASTA sequence database (in Uniprot format). This step does not require any raw files. The slowdown observed was because it was carried out for each raw file separately, whereas it should only be carried out once per FASTA.

For most DIA experiments, it's better to use the default DIA-NN settings: max 1 missed cleavage, peptide length range 7-30, no variable modifications. Anything else will lead to a slowdown and, potentially, slightly worse performance. Yes, if there's really huge amount of methionine oxidation, then including M(ox) as variable mod (and setting max var mods to some reasonable number, e.g. 1 or 2) might be ultimately beneficial, but I am yet to see a dataset where it would provide a significant advantage. Protein N-terminal acetylation is typically fine to include but it will also yield to a minor slowdown due to PTM scoring overhead.

Result: *.predicted.speclib file containing an in silico library.

Step 1 library generation from FASTA: Default parameters:

max miss cleavages = 1
min peptide length = 7
max peptide length = 30
no variable modifications
variable modifications like M(ox) and N-term(ac) should be provided by SDRF
max number of PTMs = 2

Step 2. Preliminary analysis of individual raw files. Now, one could of course do everything automatically as part of MBR analysis, but, as I understand, the goal is to make things as parallelisable as possible, so let's discuss how to do this 'manually'. Thus, step 2 consists of analysing each run separately with the in silico library generated in step 1. The mass accuracies and the scan window settings in DIA-NN should be either fixed or left automatic. If mass accuracies are automatic, please also supply the --quick-mass-acc command. Specify a folder with --temp where .quant files will be saved to.

Result: *.quant file for each raw file that contains IDs and quant info; log file.

I guess here we can provide the first option for now, which is no MBRs. (In the future, we should think about adding MBR in one big node).
By default we should provide the following parameter --quick-mass-acc unless the following parameters are provided: --mass-acc 10, --mass-acc-ms1 20

@vdemichev in this original analysis we should remove some parameters related with inference, etc. because this is done in the last step 5.? Any other parameter related with quantitation, inference, etc. that will speed this step.

Step 3. Assemble an empirical spectral library from the .quant files. For this all .quant files must be accessible for the given instance of DIA-NN. No access to the raw data is required, if DIA-NN is supplied with the --rt-profiling command, which I strongly recommend. This corresponds to "IDs, RT & IM profiling" mode of library generation, which I would recommend for this workflow. To execute this step, run DIA-NN with the --use-quant command, listing all raw files with --f though (but they don't need to be accessible, these file names are only used to infer .quant files file names). The location of .quant files can be specified with --temp, please see the docs.

Result: *.tsv spectral library.

Nothing to be clarified here 👍

Step 4. Final analysis of individual raw files. Could now use the empirical spectral library to analyse all files in one go (much faster than the preliminary step), but let's consider again how to do things manually with analysis steps for each run being separate. As in step 2, need to process each run with the empirical spectral library. Now, mass accuracies & scan window must be fixed here. In the log file produced by the Step 2, there's a line like "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7". Need to extract the settings from this line and pass to DIA-NN during this search step. Specify a folder with --temp where .quant files will be saved to, make sure it's a different folder from the one used previously, so that old .quant files (preliminary analysis) are not overwritten.

Result: *.quant file for each raw file that contains IDs and quant info.

@vdemichev when we extract this from the log files "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" in Step 2, Then run each independent file against the "summarize" library with their corresponding settings extracted from the log?

Step 2

Step 5. Summarise the info. For this need the .tsv empirical spectral library, the FASTA database to annotate it and the newly created during step 4 .quant files. Run DIA-NN specifying the raw file names with --f (but they again don't need to be accessible, these file names are only used to infer .quant files file names) and with --use-quant, with --temp specifying the .quant files location. Here indicate --relaxed-prot-inf (for FragPipe-like protein inference) and --pg-level 2 (implicit grouping by gene names). --pg-level 0 is also OK (implicit grouping by sequence IDs). If using --pg-level 2 and having multiple species in the database, can add --species-genes (adds a suffix to gene names indicating the species - works provided the database is Uniprot-formatted).

Result: DIA-NN output files.

Nothing to clarify here 👍

vdemichev commented 2 years ago

@vdemichev in this original analysis we should remove some parameters related with inference, etc. because this is done in the last step 5.? Any other parameter related with quantitation, inference, etc. that will speed this step.

Indeed, protein inference only happens at Step 5. Quantification - at Steps 4 and 5. These algorithms however don't take any CPU time.

@vdemichev when we extract this from the log files "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" in Step 2, Then run each independent file against the "summarize" library with their corresponding settings extracted from the log?

This log line will be displayed during Step 3 (sorry, I initially incorrectly wrote that it's Step 2; in fact it is Step 3, now also corrected it above) and will provide recommended averages, based on summarising the analysis carried out at Step 2. So these settings will be the same for all runs during Step 4.

daichengxin commented 2 years ago

max miss cleavages min peptide length max peptide length max_mods. These parameters are shared with msgf and comet, Do I need to modify the default values in nextflow.config?

The current modification parameters come directly from the SDRF file. no variable modifications and variable modifications like M(ox) and N-term(ac) should be provided by SDRF meaning I only keep these two as variable modifications and filter out other variable modifiers?

ypriverol commented 2 years ago

@daichengxin

[x] reuse the same variables. We may need to advise the users to reduce the max_mods in DIA analysis.
[x] as you said, continue using the SDRF for the modifications as we are doing now. I may need to annotate in a better way the SDRFs.

ypriverol commented 2 years ago

Sept 1 step finished in #169

jpfeuffer commented 2 years ago

Also in regular closed search, I would not advise searching for more 2-3 max_mods right?

ypriverol commented 2 years ago

The decision should actually be made between 2 and 3. For diann the default should be 2, but in DDA the most common used is 3 😏

jpfeuffer commented 2 years ago

Ok but I think after generating the library only once, this might not be the bottleneck anymore.. let's see

vdemichev commented 2 years ago

The larger the search space (the library), also the slower the main analysis. 3 variable modificaitons is quite a lot.

ypriverol commented 2 years ago

Because the pipeline share the variable with the DDA (and still DDA is more popular 😝) we will put as default 3, but we will let that clear in the documentation and probably also in the pipeline. @vdemichev just to be clear the user can pass that variable, we are talking here about the default value.

vdemichev commented 2 years ago

Yes, can specify any number here.

daichengxin commented 2 years ago

Hi @vdemichev, In step 3: Assemble an empirical spectral library from the .quant files. Does the --lib parameter specify the library generated by the first step or the second step ?

ypriverol commented 2 years ago

ping @vdemichev ☝️ https://github.com/bigbio/quantms/issues/164#issuecomment-1120653490

vdemichev commented 2 years ago

@daichengxin, --use-quant should notmally be used with exactly the same settings as the initial analysis. Meaning that --lib specifies the initial output library. If there are any difficulties, please let me know & attach full logs, including the log of DIA-NN producing the .quant files

daichengxin commented 2 years ago

@vdemichev Means to specify the silico-library generated in the first step? Full commands as follows in step 3 now:

diann  --f ${(mzMLs as List).join(' --f ')} \\
            --lib ${lib} \\
            ${min_pr_mz} \\
            ${max_pr_mz} \\
            ${min_fr_mz} \\
            ${max_fr_mz} \\
            --threads ${task.cpus} \\
            --out-lib empirical_library.tsv \\
            --missed-cleavages $params.allowed_missed_cleavages \\
            --min-pep-len $params.min_peptide_length \\
            --max-pep-len $params.max_peptide_length \\
            --min-pr-charge $params.min_precursor_charge \\
            --max-pr-charge $params.max_precursor_charge \\
            --var-mods $params.max_mods \\
            --cut {enzyme} \\
            --var_ptm { from sdrf } \\
            --fixed_ptm { from sdrf} \\
            --verbose $params.diann_debug \\
            --rt-profiling \\
            --temp ./quant/ \\
            --use-quant \\
            --predictor \\
            |& tee assemble_empirical_library.log

the log file does not have "Averaged recommended settings for this experiment: Mass accuracy = 11ppm, MS1 accuracy = 15ppm, Scan window = 7" info in Step 3. Full logs as follows:

daichengxin commented 2 years ago

assemble_empirical_library.log

vdemichev commented 2 years ago

@daichengxin

Means to specify the silico-library generated in the first step?

Yes.

'Averaged recommended settings' will be reported only if DIA-NN is run with either --individual-mass-acc or --individual-windows, and these options are only relevant (and strongly recommended) if the mass accuracies and scan window are not fixed. I amended the description of Steps 2 and 4 to clearly indicate this.
The --predictor command should not be used in Step 3. You already have an in silico predicted library, no point in carrying out prediction yet another time.
Not sure what are 'var_ptm' and 'fixed_ptm', but I guess this is parsed somehow before being sent to DIA-NN?

daichengxin commented 2 years ago

Thanks a lot @vdemichev. I just updated steps 2 and 3: add --individual-mass-acc and --individual-windows . Then remove --predictor in step 3. But the average recommendation settings are produced at step 2 and not at step 3. Finally I uploaded step 4:

Step 4:

    mass_acc = params.mass_acc_ms2
    scan_window = params.scan_window
    ms1_accuracy = params.mass_acc_ms1
    time_corr_only = params.time_corr_only ? "--time-corr-only" : ""

    if (params.mass_acc_automatic | params.scan_window_automatic){
        mass_acc = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 11 | tr -cd \"[0-9]\")"
        scan_window = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 19 | tr -cd \"[0-9]\")"
        ms1_accuracy = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 15 | tr -cd \"[0-9]\")"
    }

    """
    diann  --lib ${library} \\
            --f ${mzML} \\
            ${min_pr_mz} \\
            ${max_pr_mz} \\
            ${min_fr_mz} \\
            ${max_fr_mz} \\
            --threads ${task.cpus} \\
            --missed-cleavages $params.allowed_missed_cleavages \\
            --min-pep-len $params.min_peptide_length \\
            --max-pep-len $params.max_peptide_length \\
            --min-pr-charge $params.min_precursor_charge \\
            --max-pr-charge $params.max_precursor_charge \\
            --var-mods $params.max_mods \\
            --verbose $params.diann_debug \\
            --temp ./ \\
           --cut {enzyme} \\
           --var_ptm {variable modifications from sdrf input by users} \\
           --fixed_ptm {fixed modifications from sdrf input by users} \\
            --min-corr $params.min_corr \\
            --corr-diff $params.corr_diff \\
            --mass-acc \$(echo ${mass_acc}) \\
            --mass-acc-ms1 \$(echo ${ms1_accuracy}) \\
            --window \$(echo ${scan_window}) \\
            ${time_corr_only} \\

vdemichev commented 2 years ago

@daichengxin there's something wrong, average recommended settings cannot be produced during Step 2, only Step 3. Can you please share the logs?

daichengxin commented 2 years ago

@vdemichev. These are logs in step2 and step3 log in step 2 report.log.txt

log in step 3 report.log.txt

vdemichev commented 2 years ago

Need in Step 3 one of the two commands that indicate that the mass accs or scan window need to be determined separately for different runs, please see above.

daichengxin commented 2 years ago

So is my current second step command correct? It also produced average recommended settings. Step 2 Commands : diann --cut K*,R*,!*P --fixed-mod Carbamidomethyl,57.021464,C --var-mod Oxidation,15.994915,M --lib lib.predicted.speclib --f RD139_Narrow_UPS1_0_1fmol_inj1.mzML --min-pr-mz 350 --max-pr-mz 950 --min-fr-mz 500 --max-fr-mz 1500 --threads 2 --missed-cleavages 1 --min-pep-len 15 --max-pep-len 30 --min-pr-charge 2 --max-pr-charge 3 --var-mods 2 --verbose 3 --individual-windows --temp ./ --min-corr 2.0 --corr-diff 1.0 --quick-mass-acc --individual-mass-acc --time-corr-only

vdemichev commented 2 years ago

Indicated this in the main description now.

daichengxin commented 2 years ago

Nice. Thanks a lot @vdemichev. I am working.

vdemichev commented 2 years ago

Yes, Step 2 is correct, except it just produces recommended settings for a single run, which is not something of interest , so please just ignore this message in Step 2

daichengxin commented 2 years ago

Hi @vdemichev . This is full command in step 4. Could you check it? The recommended settings are extracted from log in step3 and passed to step4. Thanks a lot in advance!

 if (params.mass_acc_automatic | params.scan_window_automatic){
        mass_acc = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 11 | tr -cd \"[0-9]\")"
        scan_window = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 19 | tr -cd \"[0-9]\")"
        ms1_accuracy = "\$(cat ${diann_log} | grep \"Averaged recommended settings\" | cut -d ' ' -f 15 | tr -cd \"[0-9]\")"
    }

    """
    diann  --lib ${library} \\
            --f ${mzML} \\
            ${min_pr_mz} \\
            ${max_pr_mz} \\
            ${min_fr_mz} \\
            ${max_fr_mz} \\
            --threads ${task.cpus} \\
            --missed-cleavages $params.allowed_missed_cleavages \\
            --min-pep-len $params.min_peptide_length \\
            --max-pep-len $params.max_peptide_length \\
            --min-pr-charge $params.min_precursor_charge \\
            --max-pr-charge $params.max_precursor_charge \\
            --var-mods $params.max_mods \\
            --verbose $params.diann_debug \\
            --temp ./ \\
            --min-corr $params.min_corr \\
            --corr-diff $params.corr_diff \\
            --mass-acc \$(echo ${mass_acc}) \\
            --mass-acc-ms1 \$(echo ${ms1_accuracy}) \\
            --window \$(echo ${scan_window}) \\
            --no-ifs-removal \\
            --cut {enzyme from sdrf input by user} \\
            --var-mod {var_mods from sdrf input by user} \\
            --fixed-mod { fixed_mods from sdrf input by user} \\
            ${time_corr_only} \\

daichengxin commented 2 years ago

Full command in step 5:

    diann  --lib ${empirical_library} \\
            --fasta ${fasta} \\
            --f ${(mzMLs as List).join(' --f ')} \\
            ${min_pr_mz} \\
            ${max_pr_mz} \\
            ${min_fr_mz} \\
            ${max_fr_mz} \\
            --threads ${task.cpus} \\
            --missed-cleavages $params.allowed_missed_cleavages \\
            --min-pep-len $params.min_peptide_length \\
            --max-pep-len $params.max_peptide_length \\
            --min-pr-charge $params.min_precursor_charge \\
            --max-pr-charge $params.max_precursor_charge \\
            --var-mods $params.max_mods \\
            --verbose $params.diann_debug \\
            ${scan_window} \\
            --min-corr $params.min_corr \\
            --corr-diff $params.corr_diff \\
            ${mass_acc} \\
            ${time_corr_only} \\
            --cut ${enzyme from sdrf input bu user} \\ 
            --var-mod ${var mods from sdrf input by user} \\
            --fixed-mod ${ fixed mods from sdrf input by user} \\
            --temp ./quant/ \\
            --relaxed-prot-inf \\
            --pg-level $params.pg_level \\
            ${species_genes} \\
            --use-quant \\
            --out diann_report.tsv \\

Are mass-acc mass-acc-ms1 and window parameters still needed here? and from the recommended settings

vdemichev commented 2 years ago

Hi @daichengxin, Step 4 I would recommend to run without --min-corr and --corr-diff and without --time-corr-only. Also, during Step 5 these have no effect. --cut, --var-mod and --fixed-mod as well as all other digest settings have no effect during Steps 4 and 5. Suring Step 5 I would add --matrices. Otherwise looks good. Please let me see the full DIA-NN logs from all Steps 1-5 once the pipeline is set up.

daichengxin commented 2 years ago

ok.--min-pr-mz, --max-pr-mz, --min-fr-mz, --max-fr-mz, --min-pep-len, --max-pep-len, --min-pr-charge and --max-pr-charge should also have no effect during step4 and 5? @vdemichev

vdemichev commented 2 years ago

Yes

daichengxin commented 2 years ago

Hi @vdemichev This full DIA-NN logs from all Steps 1-5.

step1 log: lib.log.txt

step2 log: report.log.txt

step3 log: report.log.txt

step4 log: report.log.txt

step5 log: diann_report.log.txt

vdemichev commented 2 years ago

Seems good! Some comments: Step 1:

Min peptide length set to 15 - very strange, I guess for testing purposes only?
Min fragment m/z set to 500 - bad for most DIA datasets

Steps 2 to 5: digest settings have no effect. Step 4: can add --no-main-report, as don't need a report; can also turn off protein inference, as it's not necessary at this stage.

ypriverol commented 2 years ago

Yes, min 15 is because we need CI/CD to be running fast, it is the same case for min fragment m/z set to 500. Can you check the other comments @daichengxin

Thanks a lot @vdemichev for your support

daichengxin commented 2 years ago

Thanks a lot @vdemichev . There are updated logs from step2 to 5. Could you check again?

step 2: report.log.txt

step 3: report.log.txt

step 4: report.log.txt

step 5: diann_report.log.txt

vdemichev commented 2 years ago

Looks good!

ypriverol commented 2 years ago

@Thanks to @vdemichev and @daichengxin for working in the first iteration of DIANN with quantms. Here the first full dataset reanalyzed with DIANN http://ftp.pride.ebi.ac.uk/pride/data/proteomes/differential-expression/PXD004684-disease/

The volcano plot result of MSstats can be found here.

Screen Shot 2022-05-18 at 14 57 19

vdemichev commented 2 years ago

Some comments:

Homo-sapiens-uniprot-reviewed-isoforms-contaminants-decoy-202105.fasta - the FASTA database supplied to DIA-NN should never include decoys; same applies to spectral libraries
Step 4 can work fine without FASTA database here, but the subsequent protein inference (Step 5) might perform better if the FASTA database is specified (if it is, please make sure protein inference is enabled). I now amended the description of Step 4.
32 million precursors are A LOT. I would seriously consider not to specify all those variable mods & restrict missed cleavages to 1. It is fundamentally wrong to treat DIA data the same way it;s customary to deal with DDA data, and if software can do something, does not mean it's beneficial.

ypriverol commented 2 years ago

Thanks @vdemichev for the feedback.

daichengxin commented 2 years ago

Hi @vdemichev . This is the log for step 4 with fasta, Could you check again ? report.log.txt

jpfeuffer commented 2 years ago

With or without? It looks like he suggested without if I am not mistakenm

ypriverol commented 2 years ago

@vdemichev How we combine the FDR parameters (q-value thresholds in each step). I guess we MUST pass the FDR threshold in the last step?.

--matrix-ch-qvalue [x] sets the 'channel q-value' used to filter the output matrices
--matrix-qvalue [x] sets the q-value used to filter the output matrices
--matrix-tr-qvalue [x] sets the 'translated q-value' used to filter the output matrices
--matrix-spec-q run-specific protein q-value filtering will be used, in addition to the global q-value filtering, when saving protein matrices. The ability to filter based on run-specific protein q-values, which allows to generate highly reliable data, is one of the advantages of DIA-NN

For DDA data, we use the PSM and Protein FDR at the project level (meaning last step). We also filter FDR at run level PSM FDR with 0.10% by ms_run to be more stringent.

vdemichev commented 2 years ago

@daichengxin Step 4 log seems good! @ypriverol FDR thresholds during steps 2-4 should be set to 1%. During Step 5, can use any threshold. Filtering stricter than 1% might significantly reduce IDs - it's always better to analyse at 1%, and then the user can filter the output for their specific workflow. --matrix-spec-q and --matrix-qvalue affect matrices only, not the main report. I suggest not using these, as default settings for matrices seem the best for most datasets, and those users who want more stringent can filter the main report. But good to include matrices too, for those who don't use R/Python.

ypriverol commented 2 years ago

@vdemichev correct me If I'm wrong:

we will apply in Step 2,3,4 Precursor FDR default at 1%, corresponding parameter (--matrix-qvalue = 0.01), the user will be able to set different values in case they want to get more ids, but by default with be 0.01.
configurable parameter in step 5:
- --matrix-spec-q: we will pass this default value 1% (this is 0.01) to run protein filter, if the users wants to define their own value to get less highlight reliable data, they can relax this parameter. In reanalysis because we can't check in details each dataset, we may use this parameter as default 0.01 for high reliable results.
- --matrix-qvalue: Configurable also in the Step 5 for the user if they want to get more ids they may use a less stringent FDR let say 0.05.

We will not use in any step the --matrix-tr-qvalue or --matrix-ch-qvalue.

Please correct me if I understand the parameter's combination well.

vdemichev commented 2 years ago

Hi @ypriverol, matrix q-value setting only controls the q-value filtering of matrix output, nothing else. So of no relevance for Steps 1-4. For Steps 2-4, please only use --qvalue 0.01 (but 0.01 is the default value, so can also just omit this).

--matrix-spec-q: configures run-specific protein FDR for matrix output. Other software simply does not have this setting at all, filtering on run-specific protein FDR is questionable (many advise against it) and reduces data completeness. So for most applications it's better not to filter, and DIA-NN by default does not filter. Note that global protein FDR is always applied to matrices regardless of this setting.

--matrix-qvalue - again, better to keep 0.01 (default) for most applications. Whoever wants stricter filtering can always filter the main report manually. 0.05 is too relaxed for most applications. As matrices cannot be filtered further, better to stick to a reasonable value which is 0.01.

daichengxin commented 2 years ago

Hi @vdemichev . Is it reasonable to manually filter the value of the column Q.Value in main report so that downstream analysis?

vdemichev commented 2 years ago

Yes, if the --qvalue was set to value higher than 0.01 (this is OK only for Step 5), then the output report must be filtered at Q.Value <= 0.01. Applying more strict filtering than 0.01 will result in more missing values & likely worse downstream analysis. I suggest not to do that, as tghis strict filtering will inevitably result in mahy people looking at the data and thinking, falsely, that the ID numbers are low and missing value rate is high...

bigbio / quantms

DIANN changes in the current pipeline #164

Description of feature