Nesvilab / philosopher

PeptideProphet, PTMProphet, ProteinProphet, iProphet, Abacus, and FDR filtering
https://philosopher.nesvilab.org
GNU General Public License v3.0
111 stars 19 forks source link

problem running Philosopher & PeptideProphet on MSFragger output #138

Closed mo-khaife-bot closed 4 years ago

mo-khaife-bot commented 4 years ago

I'm currently running Philosopher on Linux via my universities Linux HPC Cluster

I'm having problems of getting the rest of the Philosopher workflow to work on my MSFragger output (I used the script from here http://msfragger.nesvilab.org/tutorial_linux.html)

The version of Philosopher I am using is v3.2.3

I have moved the mzML file to the same location as the pepXML As I'm running this on a HPC Cluster it has provided the following error & output file

file 1: /data/home/bt19655/01524_E01_P015424_S00_N05_R1.pepXML
 file 2: /data/home/bt19655/01524_E02_P015424_S00_N13_R1.pepXML
 processed altogether 91845 results

  - /data/home/bt19655/interact.pep.xml

  - Searching the tree...
  - Linking duplicate entries...
  - Printing results...

Using Decoy Label "rev_".
Decoy Probabilities will be reported.
Using non-parametric distributions
 (X! Tandem) (using Tandem's expectation score for modeling)
using search_offsets in mass mixture distr: 0
init with X! Tandem trypsin
MS Instrument info: Manufacturer: UNKNOWN, Model: UNKNOWN, Ionization: UNKNOWN, Analyzer: UNKNOWN, Detector: UNKNOWN

INFO: Processing standard MixtureModel ...
Initialising statistical models ...
ProteinProphet (C++) by Insilicos LLC and LabKey Software, after the original Perl by A. Keller (TPP v5.2.1-dev Flammagenitus, Build 201906251008-exported (Linux-x86_64))
 (no FPKM) (using degen pep info)
Reading in /data/home/bt19655/interact.pep.xml...
did not find any PeptideProphet results in input data!  Did you forget to run PeptideProphet?
...read in 0 1+, 0 2+, 0 3+, 0 4+, 0 5+, 0 6+, 0 7+ spectra with min prob 0.05

WARNING: no data - output file will be empty

This is the error file:

time="19:41:46" level=info msg="Executing Workspace  v3.2.3"
time="19:41:47" level=info msg="Removing workspace"
time="19:41:47" level=warning msg="Cannot read file. open .meta/meta.bin: no such file or directory"
time="19:41:47" level=info msg=Done
time="19:41:47" level=info msg="Executing Workspace  v3.2.3"
time="19:41:47" level=info msg="Creating workspace"
time="19:41:47" level=info msg=Done
time="19:41:47" level=info msg="Executing Database  v3.2.3"
time="19:41:47" level=info msg="Processing database"
time="19:41:59" level=info msg=Done
time="19:41:59" level=info msg="Executing Report  v3.2.3"
time="19:41:59" level=fatal msg="Cannot read file:open .meta/ev.param.bin: no such file or directory"
time="19:41:59" level=info msg="Executing PeptideProphet  v3.2.3"
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/01524_E01_P015424_S00_N05_R1.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/01524_E01_P015424_S00_N05_R1.mzXML in msms_run_summary tag...
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/01524_E01_P015424_S00_N05_R1.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/01524_E01_P015424_S00_N05_R1.mzXML in msms_run_summary tag...
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/01524_E02_P015424_S00_N13_R1.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/01524_E02_P015424_S00_N13_R1.mzXML in msms_run_summary tag...
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/01524_E02_P015424_S00_N13_R1.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/01524_E02_P015424_S00_N13_R1.mzXML in msms_run_summary tag...
INFO: Results written to file: /data/home/bt19655/interact.pep.xml
  - Building Commentz-Walter keyword tree... PeptideProphet  (TPP v5.2.1-dev Flammagenitus, Build 201906251008-exported (Linux-x86_64)) AKeller@ISB
 read in 0 1+, 42466 2+, 32600 3+, 11760 4+, 3719 5+, 1300 6+, and 0 7+ spectra.
Found 0 Decoys, and 91845 Non-Decoys
WARNING: No decoys with label rev_ were found in this dataset. reverting to fully unsupervised method.
negmean = 0.0533258
time="19:42:34" level=info msg=Done
time="19:42:34" level=info msg="Executing ProteinProphet  v3.2.3"
time="19:42:36" level=fatal msg="Cannot execute program. There was an error with ProteinProphet, please check your parameters and input files"
time="19:42:36" level=info msg="Executing Filter  v3.2.3"
time="19:42:36" level=info msg="Processing peptide identification files"
time="19:42:46" level=info msg="1+ Charge profile" decoy=0 target=0
time="19:42:46" level=info msg="2+ Charge profile" decoy=0 target=42466
time="19:42:46" level=info msg="3+ Charge profile" decoy=0 target=32600
time="19:42:46" level=info msg="4+ Charge profile" decoy=0 target=11760
time="19:42:46" level=info msg="5+ Charge profile" decoy=0 target=3719
time="19:42:46" level=info msg="6+ Charge profile" decoy=0 target=1300
time="19:42:46" level=info msg="Database search results" ions=72508 peptides=66097 psms=91845
time="19:42:47" level=info msg="Converged to 0.00 % FDR with 91845 PSMs" decoy=0 threshold=0 total=91845
time="19:42:48" level=info msg="Converged to 0.00 % FDR with 66097 Peptides" decoy=0 threshold=0 total=66097
time="19:42:49" level=info msg="Converged to 0.00 % FDR with 72508 Ions" decoy=0 threshold=0 total=72508
time="19:42:50" level=fatal msg="Cannot read file. open ./interact.prot.xml: no such file or directory"
time="19:42:50" level=info msg="Executing Label-free quantification  v3.2.3"
time="19:42:50" level=fatal msg="Cannot read file:open .meta/ev.param.bin: no such file or directory"
time="19:42:50" level=info msg="Executing Report  v3.2.3"
time="19:42:50" level=fatal msg="Cannot read file:open .meta/ev.param.bin: no such file or directory"
time="19:42:50" level=info msg="Executing Workspace  v3.2.3"
time="19:42:51" level=info msg="Removing workspace"
time="19:42:51" level=info msg=Done

please advice how to best progress

prvst commented 4 years ago

I see different errors in there that indicate that your database search might have failed and that your script might not have been configured appropriately. Personally, I don't recommend the use of scripts, and since this might be the first time you are trying the tools, I suggest that you try following the tutorials we have on one or two data sets. If you have more, or if you want to automate things in a better way, then I suggest trying the pipeline command.

Here it is our wiki: https://github.com/Nesvilab/philosopher/wiki

mo-khaife-bot commented 4 years ago

So I followed the "Simple Data Analysis" tutorial that was on the wiki

(apologies for this long pedantic message but I want to be very thorough as this is a big stumbling block for me)

I used the data set: 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw

I found 2 stages in the tutorial where I kept getting errors making it difficult to progress to the end

below I go into more details and provide the relevant outputs

Initially when I followed the tutorial & got to Step 3: Performing a database search with MSFragger

I then fed in the relevant things as you can see below but this did not generate a .pepXML file

[bt19655@dn19 lab_ideas]$ java -Xmx8g -jar /data/home/bt19655/Protein_Identification/Tools/MSFragger-20171106/MSFragger-3.0.jar closed_fragger.params 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
MSFragger version MSFragger-3.0
Batmass-IO version 1.17.4
(c) University of Michigan
RawFileReader reading tool. Copyright (c) 2016 by Thermo Fisher Scientific, Inc. All rights reserved.
System OS: Linux, Architecture: amd64
Java Info: 1.8.0_242, OpenJDK 64-Bit Server VM, Oracle Corporation
JVM started with 7 GB memory
Checking database...
Checking /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw...
Failed in checking /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
Batmass-IO binaries for Thermo support and/or Thermo native libraries not found found

here is the first part of my closed_fragger.params file where you can see the database I generated in Step 2 of this tutorial

num_threads = 0                             # Number of CPU threads to use.
database_name = /data/home/bt19655/lab_ideas/2020-06-17-decoys-contam-UP000005640.fas
# Path to the protein database file in FASTA format.

I had seen in some of the resources for MSFragger they advised converting the .raw file into .mzML so i did this (i used ThermoRaw File Parser GUI) & run that through MSFragger in it's place - I was then able to generate the .pepXML file & go on to the next step Was this the correct thing to do ??

This seems to be where my biggest obstacle is as you will see it affects the later stage of filtering & estimating FDR _Although Step 4: PeptideProphet seems to work as I'm able to generate the pep.xml file the output from this step is confusing as it seems to have the following problems

The instructions & output of this step is below:_

[bt19655@dn19 lab_ideas]$ ./philosopher peptideprophet --database 2020-06-17-decoys-contam-UP000005640.fas --ppm --accmass --expectscore --decoyprobs --nonparam 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pepXML 
INFO[00:14:33] Executing PeptideProphet  v3.2.3             
 file 1: /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pepXML
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzXML in msms_run_summary tag...
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzXML in msms_run_summary tag...
 processed altogether 21115 results
INFO: Results written to file: /data/home/bt19655/lab_ideas/interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml

  - /data/home/bt19655/lab_ideas/interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml
  - Building Commentz-Walter keyword tree...
  - Searching the tree...
  - Linking duplicate entries...
  - Printing results...

using Accurate Mass Bins
using PPM mass difference
Using Decoy Label "rev_".
Decoy Probabilities will be reported.
Using non-parametric distributions
 (X! Tandem) (using Tandem's expectation score for modeling)
adding ACCMASS mixture distribution
using search_offsets in ACCMASS mixture distr: 0
init with X! Tandem trypsin 
MS Instrument info: Manufacturer: UNKNOWN, Model: UNKNOWN, Ionization: UNKNOWN, Analyzer: UNKNOWN, Detector: UNKNOWN

INFO: Processing standard MixtureModel ... 
 PeptideProphet  (TPP v5.2.1-dev Flammagenitus, Build 201906251008-exported (Linux-x86_64)) AKeller@ISB
 read in 0 1+, 14880 2+, 5714 3+, 490 4+, 29 5+, 2 6+, and 0 7+ spectra.
Initialising statistical models ...
Found 2677 Decoys, and 18438 Non-Decoys
Iterations: .........10.........20......
WARNING: Mixture model quality test failed for charge (1+).
WARNING: Mixture model quality test failed for charge (6+).
WARNING: Mixture model quality test failed for charge (7+).
model complete after 27 iterations
INFO[00:15:49] Done               

Step 5 ProteinProphet worked fine I was able to generate the interact.prot.xml file no problems it was when I got to Step 6: Filter & Estimate FDR that I had problems

It seems it's unable to find the Database Data & unable to marshale file. open .meta/db.bin

The output is below:

[bt19655@dn19 lab_ideas]$ ./philosopher filter --razor --pepxml interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml --protxml interact.prot.xml 
INFO[00:36:42] Executing Filter  v3.2.3                     
INFO[00:36:42] Processing peptide identification files      
INFO[00:36:45] 1+ Charge profile                             decoy=0 target=0
INFO[00:36:45] 2+ Charge profile                             decoy=150 target=10314
INFO[00:36:45] 3+ Charge profile                             decoy=38 target=4126
INFO[00:36:45] 4+ Charge profile                             decoy=3 target=320
INFO[00:36:45] 5+ Charge profile                             decoy=0 target=0
INFO[00:36:45] 6+ Charge profile                             decoy=0 target=0
INFO[00:36:45] Database search results                       ions=14221 peptides=13061 psms=14951
INFO[00:36:45] Converged to 1.00 % FDR with 14636 PSMs       decoy=147 threshold=0.1928 total=14783
INFO[00:36:46] Converged to 1.00 % FDR with 12757 Peptides   decoy=128 threshold=0.2657 total=12885
INFO[00:36:46] Converged to 1.00 % FDR with 13935 Ions       decoy=140 threshold=0.2104 total=14075
INFO[00:36:50] Protein inference results                     decoy=270 target=5647
INFO[00:36:50] Converged to 1.03 % FDR with 1849 Proteins    decoy=19 threshold=0.9844 total=1868
INFO[00:36:51] 2D FDR estimation: Protein mirror image       decoy=1849 target=1849
INFO[00:36:52] Second filtering results                      ions=13707 peptides=12548 psms=14431
INFO[00:36:52] Converged to 0.14 % FDR with 14410 PSMs       decoy=21 threshold=0.0509 total=14431
INFO[00:36:52] Converged to 0.16 % FDR with 12527 Peptides   decoy=21 threshold=0.051 total=12548
INFO[00:36:52] Converged to 0.15 % FDR with 13686 Ions       decoy=21 threshold=0.051 total=13707
WARN[00:36:53] Cannot marshal file. open .meta/db.bin: no such file or directory 
WARN[00:36:53] Cannot serialize file. EOF                   
FATA[00:36:53] Database data not available, interrupting processing 

Please advice how I can bets overcome this ?

sarah-haynes commented 4 years ago

Thanks for giving the step-by-step tutorial a try. To read raw spectral files, MSFragger accesses libraries stored in the ext folder, which needs to be in the same directory as the MSFragger .jar file (so we don't recommend moving the MSFragger .jar file around separately). The warnings from PeptideProphet can be ignored. For the filtering step, the Philosopher workspace needs information about the sequence database (so either the database needs to be created by Philosopher in that same workspace, or an existing database must be annotated). Try running the following before re-running the filter step: philosopher database --annotate 2020-06-17-decoys-contam-UP000005640.fas

mo-khaife-bot commented 4 years ago

Thank you for your suggestion @hayse1 I went ahead with MSFragger reading mzML files as when I actually run MSFragger it will need to utilise mzML Files

everything apart from MsFragger is within my working directly that I ran philosopher from

I made sure to run the following commands before going through the rest of the tutorial

./philosopher workspace --clean
./philosopher workspace --init

I then carried out annotating of my database step as you suggested

Then I looked to see if it had ultered the database in anyway via the command below:

[bt19655@dn35 lab_ideas]$ ls -lart
total 1152018
-rw-r--r--  1 bt19655 qmul      34494 Nov 15  2019 License
-rw-rw-r--  1 bt19655 qmul      26406 Feb  7 19:54 philosopher.yml
-rw-r--r--  1 bt19655 qmul        110 Mar 11 20:30 Changelog
-rwxrwxr-x  1 bt19655 qmul  137179136 Mar 11 20:39 philosopher
-rw-r--r--  1 bt19655 qmul   68067958 Jun 17 20:04 2020-06-17-decoys-contam-UP000005640.fas
drwxr-xr-x 10 bt19655 users      4096 Jun 17 20:29 ..
-rw-r--r--  1 bt19655 qmul  672695519 Jun 17 22:21 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
-rw-r--r--  1 bt19655 qmul  300998222 Jun 17 22:43 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML
-rw-r--r--  1 bt19655 qmul       8088 Jun 17 23:59 closed_fragger.params
drwxr-xr-x  3 bt19655 qmul       4096 Jun 19 12:45 .
drwxr-xr-x  2 bt19655 qmul       1024 Jun 19 12:45 .meta

my database (2020-06-17-decoys-contam-UP000005640.fas) seems unchanged as the day I made it i.e Jun 17th as the code above shows

regardless I went through the other steps as you suggested:

when I got to the filtering step it seems it's unable to quantify the data set "The PSM list is enpty" (exactly how it's spelt)

The output is below:

[bt19655@dn35 lab_ideas]$ ./philosopher filter --razor --pepxml interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml --protxml interact.prot.xml
INFO[13:00:54] Executing Filter  v3.2.3                     
INFO[13:00:54] Processing peptide identification files      
INFO[13:00:58] 1+ Charge profile                             decoy=0 target=0
INFO[13:00:58] 2+ Charge profile                             decoy=161 target=10251
INFO[13:00:58] 3+ Charge profile                             decoy=38 target=4128
INFO[13:00:58] 4+ Charge profile                             decoy=3 target=320
INFO[13:00:58] 5+ Charge profile                             decoy=0 target=0
INFO[13:00:58] 6+ Charge profile                             decoy=0 target=0
INFO[13:00:58] Database search results                       ions=14175 peptides=13016 psms=14901
INFO[13:00:58] Converged to 1.00 % FDR with 14533 PSMs       decoy=146 threshold=0.2692 total=14679
INFO[13:00:59] Converged to 1.00 % FDR with 12659 Peptides   decoy=127 threshold=0.4089 total=12786
INFO[13:00:59] Converged to 1.00 % FDR with 13848 Ions       decoy=139 threshold=0.2984 total=13987
INFO[13:01:01] Protein inference results                     decoy=0 target=2492
INFO[13:01:01] Converged to 0.00 % FDR with 2492 Proteins    decoy=0 threshold=0.05 total=2492
INFO[13:01:02] 2D FDR estimation: Protein mirror image       decoy=2492 target=2492
INFO[13:01:02] Second filtering results                      ions=0 peptides=0 psms=0
INFO[13:01:02] Converged to 0.00 % FDR with 0 PSMs           decoy=0 threshold=10 total=0
INFO[13:01:03] Converged to 0.00 % FDR with 0 Peptides       decoy=0 threshold=10 total=0
INFO[13:01:03] Converged to 0.00 % FDR with 0 Ions           decoy=0 threshold=10 total=0
INFO[13:01:04] Post processing identifications              
INFO[13:01:06] Processing protein inference                 
INFO[13:01:11] Assigning protein identifications to layers  
INFO[13:01:12] Updating razor PSM assignment to proteins    
INFO[13:01:12] Calculating spectral counts                  
WARN[13:01:12] Cannot quantify data set. The PSM list is enpty
INFO[13:01:12] Saving                                       
INFO[13:01:13] Done     

I also noticed the values I got for some of these parameters are different from the values I got in the initial output I sen you - Is this important?

i.e I noticed particularly from the parameter that provided a converged to FDR % I was getting 0.00% as my % for these I also noticed I got FDR with 0 PSMs FDR with 0 Peptides FDR with 0 Ions

Please see the Filtering step output from the message @hayse1 responded to for a comparison of what it was & compare it to the Filtering Step output that I pasted in this message (line of code directly above this message)

Please suggest how I can best move forward Thank you for any suggestions

sarah-haynes commented 4 years ago

For Philosopher analyses, every step needs to be run sequentially, so things can get tricky when attempting to re-run individual steps. I recommend trying out the FragPipe GUI (works great on Windows and Linux desktops), which is designed to make sure entire workflows run smoothly.