Closed mo-khaife-bot closed 4 years ago
I see different errors in there that indicate that your database search might have failed and that your script might not have been configured appropriately. Personally, I don't recommend the use of scripts, and since this might be the first time you are trying the tools, I suggest that you try following the tutorials we have on one or two data sets. If you have more, or if you want to automate things in a better way, then I suggest trying the pipeline command.
Here it is our wiki: https://github.com/Nesvilab/philosopher/wiki
So I followed the "Simple Data Analysis" tutorial that was on the wiki
(apologies for this long pedantic message but I want to be very thorough as this is a big stumbling block for me)
I used the data set: 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
I found 2 stages in the tutorial where I kept getting errors making it difficult to progress to the end
below I go into more details and provide the relevant outputs
Initially when I followed the tutorial & got to Step 3: Performing a database search with MSFragger
I then fed in the relevant things as you can see below but this did not generate a .pepXML file
[bt19655@dn19 lab_ideas]$ java -Xmx8g -jar /data/home/bt19655/Protein_Identification/Tools/MSFragger-20171106/MSFragger-3.0.jar closed_fragger.params 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
MSFragger version MSFragger-3.0
Batmass-IO version 1.17.4
(c) University of Michigan
RawFileReader reading tool. Copyright (c) 2016 by Thermo Fisher Scientific, Inc. All rights reserved.
System OS: Linux, Architecture: amd64
Java Info: 1.8.0_242, OpenJDK 64-Bit Server VM, Oracle Corporation
JVM started with 7 GB memory
Checking database...
Checking /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw...
Failed in checking /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
Batmass-IO binaries for Thermo support and/or Thermo native libraries not found found
here is the first part of my closed_fragger.params file where you can see the database I generated in Step 2 of this tutorial
num_threads = 0 # Number of CPU threads to use.
database_name = /data/home/bt19655/lab_ideas/2020-06-17-decoys-contam-UP000005640.fas
# Path to the protein database file in FASTA format.
I had seen in some of the resources for MSFragger they advised converting the .raw file into .mzML so i did this (i used ThermoRaw File Parser GUI) & run that through MSFragger in it's place - I was then able to generate the .pepXML file & go on to the next step Was this the correct thing to do ??
This seems to be where my biggest obstacle is as you will see it affects the later stage of filtering & estimating FDR _Although Step 4: PeptideProphet seems to work as I'm able to generate the pep.xml file the output from this step is confusing as it seems to have the following problems
The instructions & output of this step is below:_
[bt19655@dn19 lab_ideas]$ ./philosopher peptideprophet --database 2020-06-17-decoys-contam-UP000005640.fas --ppm --accmass --expectscore --decoyprobs --nonparam 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pepXML
INFO[00:14:33] Executing PeptideProphet v3.2.3
file 1: /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pepXML
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzXML in msms_run_summary tag...
No index list offset found. File will not be read.
WARNING: cannot open data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML in msms_run_summary tag... trying .mzXML ...
WARNING: CANNOT correct data file /data/home/bt19655/lab_ideas/20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzXML in msms_run_summary tag...
processed altogether 21115 results
INFO: Results written to file: /data/home/bt19655/lab_ideas/interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml
- /data/home/bt19655/lab_ideas/interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml
- Building Commentz-Walter keyword tree...
- Searching the tree...
- Linking duplicate entries...
- Printing results...
using Accurate Mass Bins
using PPM mass difference
Using Decoy Label "rev_".
Decoy Probabilities will be reported.
Using non-parametric distributions
(X! Tandem) (using Tandem's expectation score for modeling)
adding ACCMASS mixture distribution
using search_offsets in ACCMASS mixture distr: 0
init with X! Tandem trypsin
MS Instrument info: Manufacturer: UNKNOWN, Model: UNKNOWN, Ionization: UNKNOWN, Analyzer: UNKNOWN, Detector: UNKNOWN
INFO: Processing standard MixtureModel ...
PeptideProphet (TPP v5.2.1-dev Flammagenitus, Build 201906251008-exported (Linux-x86_64)) AKeller@ISB
read in 0 1+, 14880 2+, 5714 3+, 490 4+, 29 5+, 2 6+, and 0 7+ spectra.
Initialising statistical models ...
Found 2677 Decoys, and 18438 Non-Decoys
Iterations: .........10.........20......
WARNING: Mixture model quality test failed for charge (1+).
WARNING: Mixture model quality test failed for charge (6+).
WARNING: Mixture model quality test failed for charge (7+).
model complete after 27 iterations
INFO[00:15:49] Done
Step 5 ProteinProphet worked fine I was able to generate the interact.prot.xml file no problems it was when I got to Step 6: Filter & Estimate FDR that I had problems
It seems it's unable to find the Database Data & unable to marshale file. open .meta/db.bin
The output is below:
[bt19655@dn19 lab_ideas]$ ./philosopher filter --razor --pepxml interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml --protxml interact.prot.xml
INFO[00:36:42] Executing Filter v3.2.3
INFO[00:36:42] Processing peptide identification files
INFO[00:36:45] 1+ Charge profile decoy=0 target=0
INFO[00:36:45] 2+ Charge profile decoy=150 target=10314
INFO[00:36:45] 3+ Charge profile decoy=38 target=4126
INFO[00:36:45] 4+ Charge profile decoy=3 target=320
INFO[00:36:45] 5+ Charge profile decoy=0 target=0
INFO[00:36:45] 6+ Charge profile decoy=0 target=0
INFO[00:36:45] Database search results ions=14221 peptides=13061 psms=14951
INFO[00:36:45] Converged to 1.00 % FDR with 14636 PSMs decoy=147 threshold=0.1928 total=14783
INFO[00:36:46] Converged to 1.00 % FDR with 12757 Peptides decoy=128 threshold=0.2657 total=12885
INFO[00:36:46] Converged to 1.00 % FDR with 13935 Ions decoy=140 threshold=0.2104 total=14075
INFO[00:36:50] Protein inference results decoy=270 target=5647
INFO[00:36:50] Converged to 1.03 % FDR with 1849 Proteins decoy=19 threshold=0.9844 total=1868
INFO[00:36:51] 2D FDR estimation: Protein mirror image decoy=1849 target=1849
INFO[00:36:52] Second filtering results ions=13707 peptides=12548 psms=14431
INFO[00:36:52] Converged to 0.14 % FDR with 14410 PSMs decoy=21 threshold=0.0509 total=14431
INFO[00:36:52] Converged to 0.16 % FDR with 12527 Peptides decoy=21 threshold=0.051 total=12548
INFO[00:36:52] Converged to 0.15 % FDR with 13686 Ions decoy=21 threshold=0.051 total=13707
WARN[00:36:53] Cannot marshal file. open .meta/db.bin: no such file or directory
WARN[00:36:53] Cannot serialize file. EOF
FATA[00:36:53] Database data not available, interrupting processing
Please advice how I can bets overcome this ?
Thanks for giving the step-by-step tutorial a try. To read raw spectral files, MSFragger accesses libraries stored in the ext
folder, which needs to be in the same directory as the MSFragger .jar file (so we don't recommend moving the MSFragger .jar file around separately). The warnings from PeptideProphet can be ignored.
For the filtering step, the Philosopher workspace needs information about the sequence database (so either the database needs to be created by Philosopher in that same workspace, or an existing database must be annotated). Try running the following before re-running the filter step:
philosopher database --annotate 2020-06-17-decoys-contam-UP000005640.fas
Thank you for your suggestion @hayse1 I went ahead with MSFragger reading mzML files as when I actually run MSFragger it will need to utilise mzML Files
everything apart from MsFragger is within my working directly that I ran philosopher from
I made sure to run the following commands before going through the rest of the tutorial
./philosopher workspace --clean
./philosopher workspace --init
I then carried out annotating of my database step as you suggested
Then I looked to see if it had ultered the database in anyway via the command below:
[bt19655@dn35 lab_ideas]$ ls -lart
total 1152018
-rw-r--r-- 1 bt19655 qmul 34494 Nov 15 2019 License
-rw-rw-r-- 1 bt19655 qmul 26406 Feb 7 19:54 philosopher.yml
-rw-r--r-- 1 bt19655 qmul 110 Mar 11 20:30 Changelog
-rwxrwxr-x 1 bt19655 qmul 137179136 Mar 11 20:39 philosopher
-rw-r--r-- 1 bt19655 qmul 68067958 Jun 17 20:04 2020-06-17-decoys-contam-UP000005640.fas
drwxr-xr-x 10 bt19655 users 4096 Jun 17 20:29 ..
-rw-r--r-- 1 bt19655 qmul 672695519 Jun 17 22:21 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.raw
-rw-r--r-- 1 bt19655 qmul 300998222 Jun 17 22:43 20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.mzML
-rw-r--r-- 1 bt19655 qmul 8088 Jun 17 23:59 closed_fragger.params
drwxr-xr-x 3 bt19655 qmul 4096 Jun 19 12:45 .
drwxr-xr-x 2 bt19655 qmul 1024 Jun 19 12:45 .meta
my database (2020-06-17-decoys-contam-UP000005640.fas) seems unchanged as the day I made it i.e Jun 17th as the code above shows
regardless I went through the other steps as you suggested:
when I got to the filtering step it seems it's unable to quantify the data set "The PSM list is enpty" (exactly how it's spelt)
The output is below:
[bt19655@dn35 lab_ideas]$ ./philosopher filter --razor --pepxml interact-20190202_QExHFX2_RSLC8_PST_HeLa_10ng_1ulLoop_muPAC_1hr_15k_7.pep.xml --protxml interact.prot.xml
INFO[13:00:54] Executing Filter v3.2.3
INFO[13:00:54] Processing peptide identification files
INFO[13:00:58] 1+ Charge profile decoy=0 target=0
INFO[13:00:58] 2+ Charge profile decoy=161 target=10251
INFO[13:00:58] 3+ Charge profile decoy=38 target=4128
INFO[13:00:58] 4+ Charge profile decoy=3 target=320
INFO[13:00:58] 5+ Charge profile decoy=0 target=0
INFO[13:00:58] 6+ Charge profile decoy=0 target=0
INFO[13:00:58] Database search results ions=14175 peptides=13016 psms=14901
INFO[13:00:58] Converged to 1.00 % FDR with 14533 PSMs decoy=146 threshold=0.2692 total=14679
INFO[13:00:59] Converged to 1.00 % FDR with 12659 Peptides decoy=127 threshold=0.4089 total=12786
INFO[13:00:59] Converged to 1.00 % FDR with 13848 Ions decoy=139 threshold=0.2984 total=13987
INFO[13:01:01] Protein inference results decoy=0 target=2492
INFO[13:01:01] Converged to 0.00 % FDR with 2492 Proteins decoy=0 threshold=0.05 total=2492
INFO[13:01:02] 2D FDR estimation: Protein mirror image decoy=2492 target=2492
INFO[13:01:02] Second filtering results ions=0 peptides=0 psms=0
INFO[13:01:02] Converged to 0.00 % FDR with 0 PSMs decoy=0 threshold=10 total=0
INFO[13:01:03] Converged to 0.00 % FDR with 0 Peptides decoy=0 threshold=10 total=0
INFO[13:01:03] Converged to 0.00 % FDR with 0 Ions decoy=0 threshold=10 total=0
INFO[13:01:04] Post processing identifications
INFO[13:01:06] Processing protein inference
INFO[13:01:11] Assigning protein identifications to layers
INFO[13:01:12] Updating razor PSM assignment to proteins
INFO[13:01:12] Calculating spectral counts
WARN[13:01:12] Cannot quantify data set. The PSM list is enpty
INFO[13:01:12] Saving
INFO[13:01:13] Done
I also noticed the values I got for some of these parameters are different from the values I got in the initial output I sen you - Is this important?
i.e I noticed particularly from the parameter that provided a converged to FDR % I was getting 0.00% as my % for these I also noticed I got FDR with 0 PSMs FDR with 0 Peptides FDR with 0 Ions
Please see the Filtering step output from the message @hayse1 responded to for a comparison of what it was & compare it to the Filtering Step output that I pasted in this message (line of code directly above this message)
Please suggest how I can best move forward Thank you for any suggestions
For Philosopher analyses, every step needs to be run sequentially, so things can get tricky when attempting to re-run individual steps. I recommend trying out the FragPipe GUI (works great on Windows and Linux desktops), which is designed to make sure entire workflows run smoothly.
I'm currently running Philosopher on Linux via my universities Linux HPC Cluster
I'm having problems of getting the rest of the Philosopher workflow to work on my MSFragger output (I used the script from here http://msfragger.nesvilab.org/tutorial_linux.html)
The version of Philosopher I am using is v3.2.3
I have moved the mzML file to the same location as the pepXML As I'm running this on a HPC Cluster it has provided the following error & output file
This is the error file:
please advice how to best progress