PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

Request: Ability to keep decoys during scoring->export #22

Closed abelew closed 3 years ago

abelew commented 6 years ago

Hello, I am playing with processing some data using the OpenSwathWorkflow(openMS) osw outputs to pyprophet(the main branch on github) to feature_alignment.py(msproteomicstools main branch) to SWATH2stats(slightly modified main branch) to MSstats(significantly modified main branch) and comparing the result to what happens when I cast the intensity matrix to an expressionSet and pass it to limma/DESeq2/edgeR.

In a previous iteration of the same process, I used the tsv outputs from openMS etc and was able to explicitly look at my decoy scores from the beginning to the end.

All the pieces are mostly working as expected; but I am noticing that when I get to the export stage in pyprophet I am losing all the decoys in the resulting tsv. Looking more closely at the git repository, I am seeing that there are explicit exclusions of the decoy rows in runner.py, ipf.py, and export.py. Therefore I am able to see the decoy entries if I tail the data at line 146 of export.py but they are gone after the merge on line 189.

In my own exploration into the score/export process, I have played with removing the portions of the where clauses which explicitly remove the decoys: (lines 167 of export.py, line 101 of runner.py, and a few places in ipf.py). I quickly realized this is a bad idea, as it messes up the enumeration at lines 165-168 of data_handling.py.

My primary question: Is there a specific reason to remove the decoys when scoring/exporting? If so, what is it, and why then keep the decoy columns in the data? My secondary question: Assuming I can work through the existing logic and parameterize the inclusion/exclusion of the decoys, would that be worth submitting as a pull request?

Thank you for your time.

grosenberger commented 6 years ago

Hi @abelew, thanks for trying the development version of PyProphet! Regarding the decoys: If you used IPF scoring (e.g. "pyprophet ipf" for PTM site-localization) during the analysis, unfortunately you can't export decoys at this moment, as the model does not support this currently.

If you do not require IPF inference, but did use transition-level scoring, we could implement a switch to also export decoy transitions, but we need to flag this in the legacy format, as all transitions are concatenated.

If you did not do transition-level scoring (i.e. standard OpenSWATH-mode), decoys should be exported automatically. There was a bug a few weeks ago, but if you recently updated from the GitHub master branch, this should work now. In that case, I need to look at it again and it would be great if you could provide all commands and parameters that you used.

Best regards, George

abelew commented 6 years ago

Greetings, I have tried a few things: both excluding and including ipf, including and excluding transition scoring. In all cases I can see the decoys when I step through in ipython at line 149 of export.py and lose them (depending on which scorings I used) by line ~206 when data is merged to data_protein. It does not seem to matter which way I fall on the if statements at lines on lines 149, 150. I am using the current master branch of pyprophet.

To answer your question, I am using a single OpenSwathWorkflow result file as my testing ground. This file was generated (among a bunch of others) via the following command line:

for input in ${swath_inputs}
do
 in_mzxml="mzXML/dia_${version}/${input}"
 name=$(basename "${input}" .mzXML)
 echo "Starting openswath run of ${name} using ${mz_windows} windows at $(date)."
 swath_output_prefix="${swath_outdir}/${name}_vs_${version}_${type}_${dda_method}_dia"
 OpenSwathWorkflow \
   -in "${input}" \
   -swath_windows_file "${windows/openswath_${name}.txt" \
   -tr "${transition_prefix}.pqp \
   -tr_irt iRT/iRTassays.TraML \
   -use_ms1_traces \
   -sort_swath_maps \
   -ppm \
   (a bunch of options the window size, RTNormalization, and Scoring because some runs were weird)
   -out_osw "${swath_output_prefix}.osw" \
   2>"${swath_outdir}/${name}.log 1>&2
 echo "Invoking pyprophet now to look for problematic files."
 pyprophet \
  score \
  --in "${swath_output_prefix}.osw" \
  --level ms1 \
  --out "${swath_output_prefix}_m1.osw"
 echo "MS1 scoring for ${swath_output_prefix} finished with $?"
 pyprophet \
  score \
  --in "${swath_output_prefix}_m1.osw" \
  --level ms2 \
  --out "${swath_output_prefix}_m1m2.osw"
 echo "MS2 scoring for ${swath_output_prefix} finished with $?"
 pyprophet \
  score \
  --in "${swath_output_prefix}_m1m2.osw"
  --out "${swath_output_prefix}_m1m2tr.osw"
 echo "Transition scoring for ${swath_output_prefix} finished with $?"
 pyprophet \
  protein \
  --in "${swath_output_prefix}_m1m2tr.osw"
  --context run-specific
 pyprophet \
  export \
  --no-ipf \
  --in "${swath_output_prefix}_m1m2tr.osw" \
  --out "${swath_output_prefix}_scored.tsv
 echo "pyprophet export finished with $?"
done

I arbitrarily chose one file from openswath, copied it to my testing directory as 'start.osw', changed directory to testing, and invoked the following:

pyprophet \
 score \
 --level ms1 \
 --in "start.osw" \
 --out "ms1.osw"

pyprophet \
 score \
 --level ms2 \
 --in "ms1.osw" \
 --out "ms2.osw"

pyprophet \
 export \
 --in "ms2.osw" \
 --out "ms2.tsv"

pyprophet \
 score \
 --level transition \
 --in "ms2.osw" \
 --out "transition.osw"

pyprophet \
 export \
 --in "transition.osw" \
 --out "transition.tsv"

As I suspect you can see, I did this so that I can open each intermediate in sqlite in an attempt to understand more fully what is happening. In neither case do I find the decoys in the tsv intermediates.

In my earlier testing I also performed various iterations of pyprophet protein, but those are not relevant I am thinking. In addition, at one point, in a fit of pique I updated the run ID to the filename for my transition.osw.

Thank you again for your time, this has been a most confusing little puzzle. I suppose if worse comes to worse, I do not need the decoys in my final matrix, but I rather like plotting the distribution of some of the scores for the decoys vs. the actuals on a per-sample basis before passing them on to tric/SWATH2stats/MSstats.

atb

grosenberger commented 6 years ago

Ok, could it be that you did use the setting OpenSwathWorkflow -enable_uis_scoring? If yes, could you please try to run pyprophet export --no-transition_quantification and see whether this reports the decoys?

If this is the case and resolves your issue, then this is unfortunately a current limitation of the implementation: When OpenSWATH is run in IPF mode, transition features are only stored for targets, not decoys, although all other features are stored. If you don't necessarily need transition features, you can either run OpenSWATH in non-IPF mode (if you don't need to assess PTMs) or skip the decoys. I'm planning to add a feature to also store transition-level features for decoys in OpenSwathWorkflow, but this might take a few more weeks.