bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.
https://quantms.org
MIT License
34 stars 35 forks source link

SAGE bug with Percolator #286

Closed ypriverol closed 1 year ago

ypriverol commented 1 year ago

Description of the bug

ERROR ~ Error executing process > 'NFCORE_QUANTMS:QUANTMS:TMT:ID:PSMRESCORING:PERCOLATOR (20150820_Haura-Pilot-TMT1-bRPLC11-1)'

Caused by:
  Process `NFCORE_QUANTMS:QUANTMS:TMT:ID:PSMRESCORING:PERCOLATOR (20150820_Haura-Pilot-TMT1-bRPLC11-1)` terminated with an error exit status (8)

Command executed:

  OMP_NUM_THREADS=6 PercolatorAdapter \
      -in 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage.idXML \
      -out 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage_perc.idXML \
      -threads 6 \
      -subset_max_train 300000 \
      -decoy_pattern DECOY_ \
      -post_processing_tdc \
      -score_type pep \
      -debug 0 \
      2>&1 | tee 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage_percolator.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_QUANTMS:QUANTMS:TMT:ID:PSMRESCORING:PERCOLATOR":
      PercolatorAdapter: $(PercolatorAdapter 2>&1 | grep -E '^Version(.*)' | sed 's/Version: //g')
      percolator: $(percolator -h 2>&1 | grep -E '^Percolator version(.*)' | sed 's/Percolator version //g')
  END_VERSIONS

Command exit status:
  8

Command output:
  Loading input file: 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage.idXML
  Merging peptide ids.
  Merging protein ids.
  Error: Unexpected internal error (the value '.(TMT6plex)MLLQQLDALELASEGSLK(TMT6plex)PVVLSSDPQLLFKGQFPQLSLK' was used but is not valid; Can't calculate mass-to-charge ratio for charge=0.)

Command wrapper:
  Loading input file: 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage.idXML
  Merging peptide ids.
  Merging protein ids.
  Error: Unexpected internal error (the value '.(TMT6plex)MLLQQLDALELASEGSLK(TMT6plex)PVVLSSDPQLLFKGQFPQLSLK' was used but is not valid; Can't calculate mass-to-charge ratio for charge=0.)

Work dir:
  /hps/nobackup/juan/pride/reanalysis/differential-expression/tmt/PXD004683/work/b4/62488c0990549a193960f6f1b13eed

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Command used and terminal output

No response

Relevant files

No response

System information

No response

timosachsenberg commented 1 year ago

@ypriverol do you have a link to the 20150820_Haura-Pilot-TMT1-bRPLC11-1_sage.idXML file?

ypriverol commented 1 year ago

http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/62488c0990549a193960f6f1b13eed/

timosachsenberg commented 1 year ago

hit is annotated with charge 0. will check why this is the case

    <PeptideIdentification score_type="hyperscore" higher_score_better="true" significance_threshold="0.0" RT="64.922569999999993" spectrum_reference="controllerType=0 controllerNumber=1 scan=20819" >
      <PeptideHit score="4.141100637487535" sequence=".(TMT6plex)MLLQQLDALELASEGSLK(TMT6plex)PVVLSSDPQLLFKGQFPQLSLK" charge="0" aa_before="K" aa_after="H" start="1212" end="1251" protein_refs="PH_3410" >
        <UserParam type="string" name="target_decoy" value="decoy"/>
        <UserParam type="string" name="ln(-poisson)" value="2.720636655734263"/>
        <UserParam type="string" name="ln(delta_best)" value="0.0"/>
        <UserParam type="string" name="ln(delta_next)" value="2.562488632470743"/>
        <UserParam type="string" name="ln(matched_intensity_pct)" value="3.0516565"/>
        <UserParam type="string" name="longest_b" value="3"/>
        <UserParam type="string" name="longest_y" value="5"/>
        <UserParam type="string" name="longest_y_pct" value="0.125"/>
        <UserParam type="string" name="matched_peaks" value="22"/>
        <UserParam type="string" name="scored_candidates" value="259"/>
        <UserParam type="string" name="protein_references" value="unique"/>
      </PeptideHit>
      <UserParam type="string" name="PinSpecId" value="273895"/>
    </PeptideIdentification>
ypriverol commented 1 year ago

Interesting because the charge range do not go from 0.

timosachsenberg commented 1 year ago

The others have a charge assigned. Hmm can we get the sage output somehow "results.sage.pin" to check if it is already written as zero in by sage?

ypriverol commented 1 year ago

Here http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/351da60edc24fd37a1faf9db7aeaf9/

ypriverol commented 1 year ago

The error is quite common:

ERROR ~ Error executing process > 'NFCORE_QUANTMS:QUANTMS:LFQ:ID:PSMRESCORING:PERCOLATOR (Blank)'

Caused by:
  Process `NFCORE_QUANTMS:QUANTMS:LFQ:ID:PSMRESCORING:PERCOLATOR (Blank)` terminated with an error exit status (8)

Command executed:

  OMP_NUM_THREADS=6 PercolatorAdapter \
      -in Blank_sage.idXML \
      -out Blank_sage_perc.idXML \
      -threads 6 \
      -subset_max_train 300000 \
      -decoy_pattern DECOY_ \
      -post_processing_tdc \
      -score_type pep \
      -debug 0 \
      2>&1 | tee Blank_sage_percolator.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_QUANTMS:QUANTMS:LFQ:ID:PSMRESCORING:PERCOLATOR":
      PercolatorAdapter: $(PercolatorAdapter 2>&1 | grep -E '^Version(.*)' | sed 's/Version: //g')
      percolator: $(percolator -h 2>&1 | grep -E '^Percolator version(.*)' | sed 's/Percolator version //g')
  END_VERSIONS

Command exit status:
  8

Command output:
  Loading input file: Blank_sage.idXML
  Merging peptide ids.
  Merging protein ids.
  Error: Unexpected internal error (the value 'LALLFYILTFVGAIFNGLTLLILGVIGLFTIPLLYR' was used but is not valid; Can't calculate mass-to-charge ratio for charge=0.)

Command wrapper:
  Loading input file: Blank_sage.idXML
  Merging peptide ids.
  Merging protein ids.
  Error: Unexpected internal error (the value 'LALLFYILTFVGAIFNGLTLLILGVIGLFTIPLLYR' was used but is not valid; Can't calculate mass-to-charge ratio for charge=0.)

Work dir:
  /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/work/f6/d7eb6fe239a22d425ad2597e861217

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
timosachsenberg commented 1 year ago

hmm the peptide is written as charge 1 by sage.

SpecId  Label   ScanNr  ExpMass CalcMass        FileName        retentiontime   rank    z=2     z=3     z=4     z=5     z=6     z=other peptide_len     missed_cleavages        isotope_error   ln(precursor_ppm)       fragment_ppm    ln(hyperscore)  ln(delta_next)  ln(delta_best)  aligned_rt    predicted_rt    sqrt(delta_rt_model)    matched_peaks   longest_b       longest_y       longest_y_pct   ln(matched_intensity_pct)       scored_candidates       ln(-poisson)    Peptide Proteins
273895  -1      20819   4838.7183       4838.7266       20150820_Haura-Pilot-TMT1-bRPLC11-1.mzML        64.92257        1       0       0       0       0       0       1       40      1       0.0     0.99897254      22.651972       4.141100637487535       2.562488632470743       0.0  64.92257 0.0     0.99949986      22      3       5       0.125   3.0516565       259     2.720636655734263       [+229.16293]-MLLQQLDALELASEGSLK[+229.16293]PVVLSSDPQLLFKGQFPQLSLK       DECOY_sp|Q92766|RREB1_HUMAN
jpfeuffer commented 1 year ago

For me it looks like it's annotated as "other" and SageAdapter makes a 0 out of it: https://github.com/OpenMS/OpenMS/blob/5e4854c9212847baf683981fe114f08f3534041f/src/openms/source/FORMAT/PercolatorInfile.cpp#L117

jpfeuffer commented 1 year ago

Is z=other documented? If not we can ask on sage's GitHub?

jpfeuffer commented 1 year ago

Interesting thing is that afaik the hardcoded charge range in sage is 2-5 in our version. How can there be 1 or 6?

ypriverol commented 1 year ago

Testing now.

ypriverol commented 1 year ago

Still the error remains

jpfeuffer commented 1 year ago

Where can we see the updated logs?

ypriverol commented 1 year ago

@jpfeuffer @timosachsenberg this error now is in CONSENSUSID:

ERROR ~ Error executing process > 'NFCORE_QUANTMS:QUANTMS:TMT:ID:CONSENSUSID (20150820_Haura-Pilot-TMT1-bRPLC01-2)'

Caused by:
  Process `NFCORE_QUANTMS:QUANTMS:TMT:ID:CONSENSUSID (20150820_Haura-Pilot-TMT1-bRPLC01-2)` terminated with an error exit status (8)

Command executed:

  ConsensusID \
      -in 20150820_Haura-Pilot-TMT1-bRPLC01-2_comet_feat_perc.idXML 20150820_Haura-Pilot-TMT1-bRPLC01-2_msgf_feat_perc.idXML 20150820_Haura-Pilot-TMT1-bRPLC01-2_sage_perc.idXML \
      -out 20150820_Haura-Pilot-TMT1-bRPLC01-2_consensus.idXML \
      -per_spectrum \
      -threads 1 \
      -algorithm best \
      -filter:min_support 0 \
      -filter:considered_hits 0 \
      -debug 0 \
       \
      2>&1 | tee null_consensusID.log

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_QUANTMS:QUANTMS:TMT:ID:CONSENSUSID":
      ConsensusID: $(ConsensusID 2>&1  | grep -E '^Version(.*)' | sed 's/Version: //g' | cut -d ' ' -f 1)
  END_VERSIONS

Command exit status:
  8

Command output:
  Error: Unexpected internal error (the value '1' was used but is not valid; Conflicting charge states found for peptide 'VTGISGATAENVPVEQIVENNDIIILTPQILVNNLK(TMT6plex)': 7, 1)

Command wrapper:
  Error: Unexpected internal error (the value '1' was used but is not valid; Conflicting charge states found for peptide 'VTGISGATAENVPVEQIVENNDIIILTPQILVNNLK(TMT6plex)': 7, 1)

Work dir:
  /hps/nobackup/juan/pride/reanalysis/differential-expression/tmt/PXD004683/work/3d/cd8e3f8ef03b30f3e01203b8ce664c

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
timosachsenberg commented 1 year ago

Ok this error is triggered in ConsensusID now: https://github.com/OpenMS/OpenMS/blob/f958a6bc954de76f373ea5bf659c04c11b0fce38/src/openms/source/ANALYSIS/ID/ConsensusIDAlgorithm.cpp#L135

Ah ok, I think the reason is that the spectrum is annotated as charge 7 (!) and sage sets the "z=other" column to 1 (see https://github.com/lazear/sage/blob/675d507bea942a792a87408ddf4f8c38195b59cf/crates/sage-cli/src/output.rs#L196C27-L200 ). I misinterpreted how this is handled by sage and thought that the z=other column will get the value 7 in this case.

Because I can't reconstruct the proper charge from the pin file I think the best approach currently would be to just filter those PSMs.

@lazear would it make sense to encode charges outside the expected range in the "z=other" column instead of using a one-hot encoding? That way I could set a proper charge for the hit without going back to the spectra. For the SVM this should be no problem.

lazear commented 1 year ago

That is correct, and I can encode the charges outside of the range as an integer value (e.g. 2 or 7) instead of one-hot-encoding them.

Are you guys actually running percolator directly off of the .pin files? Could also pull the same features (not all log-transformed though), like the actual integer value of charge, from the results.sage.tsv file - up to you if that's an easier way.

timosachsenberg commented 1 year ago

For consistency reasons, we currently only read the pin file. Reason is that this way we have full control (e.g., we can still annotate some metadata needed to do consensus identification etc.).

lazear commented 1 year ago

Gotcha - I will update the pin file

ypriverol commented 1 year ago

This is solved.