stat.tsv counts differ strongly from DIA-NN results

MannLabs / alphadia

modular & open DIA search

https://alphadia.readthedocs.io

Apache License 2.0

41 stars 3 forks source link

stat.tsv counts differ strongly from DIA-NN results #240

Closed tobiasko closed 2 months ago

tobiasko commented 3 months ago

Describe the bug number of precursors reported by alphadia in stat.tsv deviates strongly from precursor values reported by DIA-NN (library-free mode) on the same files.

run     channel precursors      proteins        ms1_accuracy    fwhm_rt fwhm_mobility
20240507_022_HeLa_100ng_40SPD   0       75157   7863    0.000000        3.257659        0.000000
20240507_023_HeLa_100ng_40SPD   0       66053   7826    0.000000        3.342673        0.000000
20240507_024_HeLa_100ng_40SPD   0       66883   7829    0.000000        3.309790        0.000000

vs.

File.Name   Precursors.Identified   Proteins.Identified Total.Quantity  MS1.Signal  MS2.Signal  FWHM.Scans  FWHM.RT Median.Mass.Acc.MS1 Median.Mass.Acc.MS1.Corrected   Median.Mass.Acc.MS2 Median.Mass.Acc.MS2.Corrected   MS2.Mass.Instability    Normalisation.Instability   Median.RT.Prediction.Acc    Average.Peptide.Length  Average.Peptide.Charge  Average.Missed.Tryptic.Cleavages
/scratch/DIANN_A314/WU304711/20240507_024_HeLa_100ng_40SPD.mzML 139825  8600    3.06597e+10 3603428016128   323304587264    2.277   0.061   0.455805    0.349079    0.98658 0.862623    0.2 0.0544855   0.0192902   12.111  2.26    0.115
/scratch/DIANN_A314/WU304711/20240507_023_HeLa_100ng_40SPD.mzML 140577  8663    3.29458e+10 3894134177792   348319154176    2.257   0.061   0.494769    0.373496    1.10151 0.757309    0.2 0.0636678   0.0177708   12.13   2.26    0.115
/scratch/DIANN_A314/WU304711/20240507_022_HeLa_100ng_40SPD.mzML 145582  8699    3.78455e+10 4481101070336   378195836928    2.277   0.061   0.474137    0.378867    1.04699 0.689239    0.2 0.0834113   0.0207038   12.248  2.265   0.114

According to Fig. 5 of your manuscript one would expect similar values on precursor and protein counts on ASTRAL data (in-house generated using 2 Da fixed window DIA). How do you filter the precursor data for the stat.tsv file?

Logs attached log.txt

Version (please complete the following information):

Installation Type Developer

tobiasko commented 3 months ago

We analyzed 100 ng Pearce HeLa digest using the Evospep One (40 SPD method). So very similar to your manuscript.

GeorgWa commented 3 months ago

Hi, the two big caveats are that we don't have automatic parameter optimization and multi step searches. So both needs to be done manually in the current version (1.7.0)

To get the same results you will need to set the following parameters in the first search:

target_num_candidates = 3
target_ms1_tolerance = 4 ppm,
target_ms2_tolerance = 7 ppm
target_rt_tolerance = 300s

Then perform a second search without library prediction and select the speclib.mbr.hdf from the first pass

target_num_candidates = 5
target_ms1_tolerance = 4 ppm,
target_ms2_tolerance = 7 ppm
target_rt_tolerance = 100s

This should give you very similar performance :)

tobiasko commented 3 months ago

😟 Now I am really surprised. Currently, no parameter opt. and no 2nd search with the opt. library? Automating the 2nd search using a script is easy to do. But how should I guess the opt. parameters without having seen the data a priori?

GeorgWa commented 3 months ago

Hi, it's not as bad as it sounds. 😄

Lots of parameters are already bring automatically optimized but there are still three major parameters: MS1 tolerance, MS2 tolerance and RT tolerance. MS1 and MS2 are self explanatory and RT tolerance is the largest expected error after calibration. We usually set this to 200-300s for 60, 40SPD.

the reason is that we want to offer best performance across methods and gradients. Therefore we are currently compiling a test suite with different methods and setups.

tobiasko commented 3 months ago

Ok. Let's get practical. I am having a set of DIA raw files from the ASTRAL and acquired data incl. lock mass correction. So we basically know it is high mass accuracy data (because you can check lock mass correction easily without doing a DB search first). Then you are suggesting to use a default values for all OT scans (MS1), here you used 4 ppm? Why exactly 4? And a default value for ASTRAL scans (MS2), again why exactly 7? And your iRT matching tolerance is simply based on historical knowledge (because you know your column setup and the Evosep performance, so you go for +- 2 min)? Why 2?

GeorgWa commented 3 months ago

Yes, this is similar to how 5ppm and 10ppm has been the default for 70k, 35k resolution on orbitrap instruments or 15ppm for timsTOF default DIA settings. We have seen that Astral data works well with 6-7ppm.

Regarding the RT tolerance you should aim for 30% of total gradient (21min*0.3 = 6.3min; +- 189sec) for the first search and 15% for the second search. Otherwise I would recommend to look at the methods section of the manuscript for inspiration.

That being said, I absolutely I agree that this is not practical and we are working on a prototype to solve this. The reason is really that we have to set priorities and as we have very well controlled and standardized instrument setup with Evosep, ion opticks and the Astrals. This was therefore fine for getting good performance and we could focus on establishing confident FDR, speed, quantification etc.

I'm curious to hear how alphaDIA performs on your data with the updated parameters! I can update you when we have the first release to test. As you are using the developer version anyways, any feedback on the automated optimization would be appreciated once its part of a release candidate.

tobiasko commented 3 months ago

As you predicted, using the 2-pass search with the above parameters gives:

run     channel precursors      proteins        ms1_accuracy    fwhm_rt fwhm_mobility
20240507_022_HeLa_100ng_40SPD   0       120283  8611    0.000000        2.986437        0.000000
20240507_023_HeLa_100ng_40SPD   0       117096  8612    0.000000        2.937871        0.000000
20240507_024_HeLa_100ng_40SPD   0       116863  8612    0.000000        2.934775        0.000000

I implemented this using a very basic slurm batch script

#!/bin/bash

#SBATCH --partition=prx
#SBATCH --nodelist=fgcz-c-073
#SBATCH --export=ALL
#SBATCH -o /scratch/tobiasko/slurmlog/slurm-%j.out
#SBATCH --job-name=alphaDIA
#SBATCH --mail-user="tobias.kockmann@fgcz.ethz.ch"
#SBATCH --workdir=/scratch/tobiasko/slurmjobs
#SBATCH --mem=256G
#SBATCH --time=24:0:0

OUTPUT1="out-first-pass"
OUTPUT2="out-second-pass"
RAWFILES="rawfiles"
FASTA="DB.fasta"
CONFIG1="config_1.yaml"
CONFIG2="config_2.yaml"
REGEX="100ng_40SPD.raw"
SPECLIB="/scratch/tobiasko/slurmjobs/$OUTPUT1/speclib.mbr.hdf"

echo "hostname: "
hostname
echo "working directory: "
pwd
echo "activate alphadia env"
source activate alphadia
echo "using alphadia version: "
alphadia --version
mkdir $OUTPUT1
alphadia --output $OUTPUT1 --directory $RAWFILES --regex $REGEX --config $CONFIG1 --fasta $FASTA
mkdir $OUTPUT2
alphadia --output $OUTPUT2 --directory $RAWFILES --regex $REGEX --config $CONFIG2 --library $SPECLIB
echo "slurm job done!"
exit 0

But I think you should really change the manuscript in this respect: "With these state-of-the art predicted libraries, we devised a two-step search workflow in alphaDIA consisting of library refinement and quantification (Fig. 5 a)." This sounds like the alphadia actively manages both steps as part of an integrated workflow.

Are the precursor now counted at 1% FDR? Why is the MS1 accuracy zero? Is it possible to add the MS2 accuracy?

Best, Tobi

GeorgWa commented 3 months ago

Ah, that's good to see! I would assume that there is some additional performance with parameter optimization.

I use a very similar SLURM script for my two step searches and hope to inlcude it in the docs soon. We wrote it like this in the manuscript because it was designed for multi step searches and therefore builds the MBR library etc. This should allow testing and benchmarking. From a scientific point of view this was the priority. Therefore, also the FDR is not arbitrary but was made to work in such context. It's actually surprisingly hard to get all of this somewhat right :D.

Of course it's important for adoption to make it easily accessible in the GUI and allow multi step searches. At the same time, we don't just want to have a single check box which gives you some multi step search but we want to have this in a configurable, transparent fashion, designed with good software engineering in mind.

It's really cool to see that you have already explored alphaDIA a bit. If you have time I would happy to schedule a Zoom call and hear about your experience from the perspective of a very technical user. This is to be an open source community project, so all contribution is welcome. You can reach me at (lastname)@biochem.mpg.de 😀.

Regarding the FDR, it's always controlled on a local precursor and global protein level. Therefore it will be controlled for the first search, the MBR library and the second search results. Something I would recommend is to set the inference_strategy to library for the second search and only use the fasta in the first search.

tobiasko commented 3 months ago

Ok! Will try. I was already guessing that you use alphadia within slurm. I think this is in general something of interest for the community. Are there things that one should be aware of when running alphadia by sbatch or srun? Are you managing CPUs or memory for alphadia jobs?

GeorgWa commented 3 months ago

Yes, we have some pipelines and do most processing on Slurm. I have some templates for transfer learning and two step searches I will share.

We use sbatch with conda, similar to your script. I would generally recommend to only use a single socket at a time. So only use as many threads as available on a single socket and allow two tasks per node if you have two socket machines. For simple searches 64gb to 128 GB should be sufficient, for more complicated searches 256gb is better. We haven't really optimized for memory yet.

Another trick is to use the --config-dict argument in the CLI. https://alphadia.readthedocs.io/en/latest/methods/command-line.html This allows you to change parameters from the config file for method optimization in the slurm script. For example, you could call it with 200s, 250s and 300s RT tolerance and observe the effect. Something else to test would be 1, 2 and 3 target candidates.

tobiasko commented 3 months ago

I am currently testing on a single cluster node like:

scontrol show node fgcz-c-072
NodeName=fgcz-c-072 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=128 CPULoad=0.52
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=fgcz-c-072 NodeHostName=fgcz-c-072 Version=18.08
   OS=Linux 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31)
   RealMemory=1030000 AllocMem=0 FreeMem=227312 Sockets=128 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=5000000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=prx
   BootTime=2024-03-04T11:38:41 SlurmdStartTime=2024-03-04T11:39:09
   CfgTRES=cpu=128,mem=1030000M,billing=128
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

If I get the above correctly we have 128 CPUs, each sitting in a separate socket having a single core. Each core is set for a single thread. So your recommendation would mean using a single thread? 😄

So far I was running 32 or 64 thread alphadia commands without setting anything in sbatch or srun. I am not a computer scientist, so these things

https://slurm.schedmd.com/cpu_management.html

are not all perfectly clear to me.

tobiasko commented 3 months ago

I am constantly modifying the slurm batch script, but currently the 2-pass search jobs like like:

scontrol show jobid -dd 328740
JobId=328740 JobName=alphaDIA
   UserId=tobiasko(42847) GroupId=SG_Employees(10147) MCS_label=N/A
   Priority=4294623242 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:02:24 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2024-06-21T14:27:49 EligibleTime=2024-06-21T14:27:49
   AccrueTime=Unknown
   StartTime=2024-06-21T14:27:49 EndTime=2024-06-22T14:27:49 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2024-06-21T14:27:49
   Partition=prx AllocNode:Sid=fgcz-c-073:1273317
   ReqNodeList=fgcz-c-073 ExcNodeList=(null)
   NodeList=fgcz-c-073
   BatchHost=fgcz-c-073
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=256G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=fgcz-c-073 CPU_IDs=0 Mem=262144 GRES_IDX=
   MinCPUsNode=1 MinMemoryNode=256G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/tobiasko/slurmjobs/run2PassSlurmJob.sh
   WorkDir=/scratch/tobiasko/slurmjobs
   StdErr=/scratch/tobiasko/slurmlog/slurm-328740.out
   StdIn=/dev/null
   StdOut=/scratch/tobiasko/slurmlog/slurm-328740.out
   Power=

tobiasko commented 3 months ago

The slurm batch script for the 2-pass search

#!/bin/bash

#SBATCH --partition=prx
#SBATCH --nodelist=fgcz-c-073
#SBATCH --export=ALL
#SBATCH -o /scratch/tobiasko/slurmlog/slurm-%j.out
#SBATCH --job-name=alphaDIA
#SBATCH --mail-user="tobias.kockmann@fgcz.ethz.ch"
#SBATCH --workdir=/scratch/tobiasko/slurmjobs
#SBATCH --mem=256G
#SBATCH --time=24:0:0

OUTPUT1="out-first-pass"
OUTPUT2="out-second-pass"
RAWFILES="rawfiles"
FASTA="DB.fasta"
CONFIG1="config_1.yaml"
CONFIG2="config_2.yaml"
REGEX="100ng_40SPD.raw"
SPECLIB="speclib.mbr.hdf"
CONFDICT="--config-dict "{'thread_count':{32}}""

echo "hostname: "
hostname
echo "working directory: "
pwd
mkdir slurmjob-$SLURM_JOB_ID

#activate conda env
echo "activate alphadia env"
source activate alphadia
echo "using alphadia version: "
alphadia --version

#first-pass search
mkdir slurmjob-$SLURM_JOB_ID/$OUTPUT1
srun alphadia --output slurmjob-$SLURM_JOB_ID/$OUTPUT1 --directory $RAWFILES --regex $REGEX --config $CONFIG1 --fasta $FASTA
if [ $? -eq 0 ]
  then
    echo "First-pass search done!"
  else
    echo "First-pass search exited with non-zero exit status."
    echo "Stopping slurm batch script"
    exit 1
fi

#second-pass search
mkdir slurmjob-$SLURM_JOB_ID/$OUTPUT2
srun alphadia --output slurmjob-$SLURM_JOB_ID/$OUTPUT2 --directory $RAWFILES --regex $REGEX --config $CONFIG2 --library slurmjob-$SLURM_JOB_ID/$OUTPUT1/$SPECLIB
if [ $? -eq 0 ]
  then
    echo "Second-pass search done!"
    exit 0
  else
    echo "Second-pass search exited with non-zero exit status."
    echo "Stopping slurm batch script"
    exit 1
fi

not quite sure if the srun commands around alphadia are a good idea for such a linear job on a single cluster node. For the config-dict argument it is YAML in-line block notation?

GeorgWa commented 2 months ago

Hi, this looks good, although your script is already much more sophisticated than mine :D I think the single core per socket is not real and due to the configuration of the cluster.

I your case with ThreadsPerCore=1 I would go for 32 to 64 cpus and the same number of threads in alphaDIA. Our nodes have ThreadsPerCore=2so I set double the number of threads in alphaDIA compared to the --cpus-per-task

The config dict is a json.

#!/usr/bin/env bash
#SBATCH --job-name=alphadia
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=250G
#SBATCH --time=21-00:00:00
#SBATCH --partition=p.hpcl92

RAW_FOLDER="/raw_data/"
LIBRARY_FOLDER="/fasta/"

OUTPUT_FOLDER_FIRST="/first_pass/"
OUTPUT_FOLDER_SECOND="/second_pass/"

echo "First pass saved to: ${OUTPUT_FOLDER_FIRST}"
mkdir -p ${OUTPUT_FOLDER_FIRST}

alphadia \
    -o ${OUTPUT_FOLDER_FIRST} \
    --directory ${RAW_FOLDER} \
    --fasta "${LIBRARY_FOLDER}2024_01_12_human.fasta" \
    --config "config_astral_first_pass.yaml"

echo "Second pass saved to: ${OUTPUT_FOLDER_SECOND}"
mkdir -p ${OUTPUT_FOLDER_SECOND}

alphadia \
    -o "${OUTPUT_FOLDER_SECOND}" \
    --directory ${RAW_FOLDER} \
    --library "${OUTPUT_FOLDER_FIRST}speclib.mbr.hdf" \
    --config "config_astral_second_pass.yaml" \
    --config-dict '{"fdr": {"inference_strategy": "library"}}'