Using PEP statistical analysis, and how to see q- and e-values

CarlaCristinaUranga commented 8 years ago

Hi, Peptide Shaker is crashing when I ask to use the PEP statistical analysis instead of FDR. I read a useful article on interpreting data and the PEP would be useful for me, since I just want to be sure I am identifying proteins correctly. Is there any way to get q-values and E-values from corresponding search algorithms such as MSGFplus and OMSSA? Thank you kindly. I am a big fan of this pipeline! It is much more user-friendly than TPP. Thanks and be well.

mvaudel commented 8 years ago

Hi, Sorry to hear that you are encountering issues with the software. Can you detail what happened and how to reproduce it? Also, can you send us the log of the software? You can find it under the Help → Bug Report menu. You can display the scores on the interface using the View → Scores menu. The scores given by every search engine can be inspected in the Spectrum Identification tab. Note that they might be log transformed (-10log(e-value)). You can also export all scores via the Export → Identification Features menu. Hope this helps, and many thanks for the kind words on our work :) Marc

CarlaCristinaUranga commented 8 years ago

Hi Marc, so I am using a SLURM batch file to execute SearchGUI on 4 nodes at a server here at CICESE (where I am getting my doctorate). However, the .err file is giving this readout:

[curanga@omica SearchGUI-3.1.1]$ tail -f proteomica.err at javax.swing.SwingWorker.doneEDT(SwingWorker.java:740) at javax.swing.SwingWorker.access$100(SwingWorker.java:225) at javax.swing.SwingWorker$2.done(SwingWorker.java:302) at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384) at java.util.concurrent.FutureTask.setException(FutureTask.java:251) at java.util.concurrent.FutureTask.run(FutureTask.java:271) at javax.swing.SwingWorker.run(SwingWorker.java:334) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142 ) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617 ) at java.lang.Thread.run(Thread.java:745)

and the .log file is giving this readout: [curanga@omica SearchGUI-3.1.1]$ tail -f proteomica.log Fri Oct 14 17:58:36 PDT 2016 Validating MGF file: /LUSTRE/curanga/proteomica/SearchGUI-3. 1.1/carla_lasiodip_1.mgf Fri Oct 14 17:58:37 PDT 2016 Validating MGF file: /LUSTRE/curanga/proteomica/SearchGUI-3. 1.1/carla_lasiodip_2.mgf

And the .log file seems to be stuck. If you could help me I would appreciate it. All of my issues are due to lack of memory/RAM, and lack of knowledge on my part. I love bioinformatics, though, computers are amazing.

Best wishes,

Carla Uranga

On Fri, Oct 14, 2016 at 12:21 AM, Marc Vaudel notifications@github.com wrote:

Hi, Sorry to hear that you are encountering issues with the software. Can you detail what happened and how to reproduce it? Also, can you send us the log of the software? You can find it under the Help → Bug Report menu. You can display the scores on the interface using the View → Scores menu. The scores given by every search engine can be inspected in the Spectrum Identification tab. Note that they might be log transformed ( -10log(e-value)). You can also export all scores via the Export → Identification Features menu. Hope this helps, and many thanks for the kind words on our work :) Marc

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/compomics/peptide-shaker/issues/212#issuecomment-253727757, or mute the thread https://github.com/notifications/unsubscribe-auth/AVXjiy6s0xoR8_RnuRj17iDGHTSuV-XFks5qzy17gaJpZM4KVWin .

mvaudel commented 8 years ago

Hi Carla,

Many thanks for the additional information. Would it be possible for you to share the command lines executed with us?

When working on multiple nodes, we recommend to use one clean copy of SearchGUI per node. This is due to the fact that search engines use temporary files that we don't have control over. So using multiple command lines on the same instance of SearchGUI might create conflicts. Similarly, you might want to redirect all logs and temporary files using the -log and _-tempfolder options. See https://github.com/compomics/searchgui/wiki/SearchCLI#generic-temporary-folder.

That said, the original report was on the PeptideShaker usage? If you are using the tool in command line you will find the log in the resources folder of the tool. Here again you might want to use the -log and _-tempfolder options :)

Hope this helps!

Marc

CarlaCristinaUranga commented 8 years ago

Hi Marc,

Thank you so much for your attention. I got it to work on the server, a computer expert showed me how to call up java in SBATCH mode, so again, it is my computer programming ignorance that was the problem. The SearchGUI graphical interface PEP function mysteriously stopped getting stuck... it's as if the computer is learning how to run your code. Is this what they call "machine learning"? If so, it is quite amazing.

So my issues now are how to report my results. I found some novel proteins in fungi homologous to human structural proteins that is a very, very novel concept for fungi and I am excited to try to publish. However, the different peptides were assigned by SearchGUI with high confidence to the same accession number three times, but not considered "valid". In other words, the same protein accession number was detected three times with 94-100% confidence, but not considered valid individually. I am wondering if it would be possible to have the algorithm call it "valid" by putting all the significant peptides together, since, each on their own are valid peptides identified with many, many PSMs.

Technically, the two unique peptide rule is being followed, but there is something about the algorithm that is not putting the results together in one, valid protein.

This search was run on MSGFplus, using an all-human Uniprot database. This same protein was identified with MASCOT, so it is worth trying to publish. However, philosophically speaking, I am wondering about why the algorithm reported the same protein three times instead of once.

If your team is interested in fungal proteomics, and would like to collaborate with bioinformatics issues like this, I wouldn't mind putting you on as a co-author. I am mostly pondering the significance of finding novel proteins, but obviously the bioinformatics part is very important, and fungal proteomics is stuck in the 2D SDS-PAGE era. A publication like this would help this important field grow.

Again, thank you and this computer learning is never-ending and wonderful.

Sincerely,

Carla Uranga

On Sat, Oct 15, 2016 at 3:25 AM, Marc Vaudel notifications@github.com wrote:

Hi Carla,

Many thanks for the additional information. Would it be possible for you to share the command lines executed with us?

When working on multiple nodes, we recommend to use one clean copy of SearchGUI per node. This is due to the fact that search engines use temporary files that we don't have control over. So using multiple command lines on the same instance of SearchGUI might create conflicts. Similarly, you might want to redirect all logs and temporary files using the -log and _-tempfolder options. See https://github.com/compomics/ searchgui/wiki/SearchCLI#generic-temporary-folder.

That said, the original report was on the PeptideShaker usage? If you are using the tool in command line you will find the log in the resources folder of the tool. Here again you might want to use the -log and _-tempfolder options :)

Hope this helps!

Marc

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/compomics/peptide-shaker/issues/212#issuecomment-253975690, or mute the thread https://github.com/notifications/unsubscribe-auth/AVXjiykTFsyZQMuOec10H2oFLBIzl_d5ks5q0Kn-gaJpZM4KVWin .

mvaudel commented 8 years ago

Hi Carla,

Good to hear that the search went through! I am afraid SearchGUI would not fall in the machine learning category, but having it to complete its job is enjoyable. Finding new proteins is a big challenge and you should be careful with this. If you have been searching your fungi data against human only please be very cautious: the algorithm might map spectra from fungi peptides to human sequences by mistake. In general, we recommend to search fungi+human together to keep a fair competition between human and fungi sequences. There have been similar problems in the past when people tried to identify fungi in bees. I recommend reading the following publication: https://www.ncbi.nlm.nih.gov/pubmed/21695130 It seems that Did you try to load your SearchGUI results in PeptideShaker? This should allow you to inspect the results in details. If yes, can you export the project as zip file (from the user interface see the Export menu) and share it with us? Note that you can add Mascot results as well.

Good luck with your data!

Marc

CarlaCristinaUranga commented 8 years ago

Hi Dr. Vaudel!

Thank you for this reference, I am definitely using a 1% FDR and a variety of different databases, including all-uniprot, all-eukaryote, all-fungi databases (all from the Uniprot reviewed database) and an all-Botryosphaeriaceae database from NCBI. I started a dropbox folder and added you so you may access the current "test" .mgf file. Because of memory issues on my laptop, I obtained my current results by splitting the files, I don't know if this may have an effect on the Peptide Shaker results, I did add them together, of course.

When zeroing in on a certain family of proteins, what do you think of a database that includes all proteins of that family, from every species thus far sequenced? Would this aid in confirming the presence of a variant of some sort? I look forward to hearing from you.

Sincerely,

Carla

On Sat, Oct 15, 2016 at 2:42 PM, Marc Vaudel notifications@github.com wrote:

Hi Carla,

Good to hear that the search went through! I am afraid SearchGUI would not fall in the machine learning category, but having it to complete its job is enjoyable. Finding new proteins is a big challenge and you should be careful with this. If you have been searching your fungi data against human only please be very cautious that the algorithm did not map fungi proteins to human sequences by mistake. In general, we recommend to search fungi+human together to keep a fair competition between human and fungi sequences. There have been similar problems in the past when people tried to identify fungi in bees. I recommend reading the following publication: https://www.ncbi.nlm.nih.gov/pubmed/21695130 It seems that Did you try to load your SearchGUI results in PeptideShaker? This should allow you to inspect the results in details. If yes, can you export the project as zip file (from the user interface see the Export menu) and share it with us? Note that you can add Mascot results as well.

Good luck with your data!

Marc

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/compomics/peptide-shaker/issues/212#issuecomment-254012897, or mute the thread https://github.com/notifications/unsubscribe-auth/AVXjiy-LvM0C6CAYrJCSaEfB_fP-9Gz5ks5q0UjUgaJpZM4KVWin .

mvaudel commented 8 years ago

Hi Carla,

Many thanks for sharing the files. I will give them a look when I get time. I am not an expert in metaproteomics so am not sure what is the best method to narrow down to a specific set of species. Maybe you will find more competent people at qa.proteomics-academy.org? It is a Q&A on proteomics some colleagues recently started.

In any case, I would recommend to always have contaminants in your database in order to avoid the incorrect identification of keratin/trypsin spectra. For this we generally use the crap sequences http://www.thegpm.org/crap/.

Hope this helps,

Marc

hbarsnes commented 7 years ago

(As the original question has been answered, we will now close this issue. Feel free to open a new one if you have further questions though.)

CarlaCristinaUranga commented 7 years ago

Hi Marc,

No worries, I finally got it to work with the entire Uniprot database, so I am sending you the default protein report from this awesome program. I think I could tell a VERY good story with the validated proteins, and would like to include some of the "doubtful" proteins because they are so interesting. However, they are identified with only one unique peptide, although many with 4 or more PSMs. In your expert opinion, should I stick to reporting only the validated proteins or is it OK to discuss the "doubtful" proteins as potential annotation targets? Although I did use Peaks7 and 8 (30-day free trials) to identify more because of the Spider functionality (greatly speeds up PTM identification), I like the confidence SearchGUI adds to the identifications, and allows one to focus on the validated proteins in terms of biological significance. However, the "doubtful" ones, which are also identified with great confidence, lack another peptide to make these "valid", am I correct? In any case, thank you so much, I am going to try to publish this little metaproteomic study focusing on the results from SearchGUI due to the great user-friendly pipeline it establishes.

Best wishes,

Carla Uranga

On Wed, Oct 19, 2016 at 2:43 AM, Marc Vaudel notifications@github.com wrote:

Hi Carla,

Many thanks for sharing the files. I will give them a look when I get time. I am not an expert in metaproteomics so am not sure what is the best method to narrow down to a specific set of species. Maybe you will find more competent people at qa.proteomics-academy.org? It is a Q&A on proteomics some colleagues recently started.

In any case, I would recommend to always have contaminants in your database in order to avoid the incorrect identification of keratin/trypsin spectra. For this we generally use the crap sequences http://www.thegpm.org/crap/.

Hope this helps,

Marc

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/compomics/peptide-shaker/issues/212#issuecomment-254764853, or mute the thread https://github.com/notifications/unsubscribe-auth/AVXjizMxbftEAt9xw7BFK96Rk2p-XZUiks5q1eYzgaJpZM4KVWin .

mvaudel commented 7 years ago

Hi Carla and thanks for the kind words.

The validation in PeptideShaker makes three categories: not validated (in red in the interface) validated and confident (in green), validated and doubtful (in yellow). The validation is set using statistical thresholds, e.g. 1% FDR, and the doubtful/confident using quality filters. In your results, you can consider all validated hits (green + yellow) as identified. However, you need to be careful when discussing the yellow ones. They are more likely than the green to be false positives. If you have some very interesting results in yellow, it is promising, but further experiments are required to confidently assess their presence. In general we don't discuss the hits in red, but it does not mean that all should be discarded. You can actually inspect the false negative rate in the validation tab and get an idea of how many correct identifications are in red.

Hope this helps,

Marc

compomics / peptide-shaker

Using PEP statistical analysis, and how to see q- and e-values #212