fraction of peptides mapped to Gigaton

shellywanamaker commented 4 years ago

@emmats I just got reviewer comments back for the oyster proteomics paper and a reviewer asked "what fraction of peptides could be mapped to Gigaton in the different oyster seed samples." I'm not exactly sure what numbers I need to calculate this.

I found the number of spectra loaded by Comet listed for each sample in the Comet log file output file (http://owl.fish.washington.edu/halfshell/working-directory/17-02-14b/output.error.comet.log)

And I know that the abacus output file contains the columns "NUMSPECSTOT", "NUMSPECSUNIQ", "NUMPEPSTOT", AND "NUMPEPSUNIQ" that I could get the column sums for each sample.

How would you recommend calculating this?

emmats commented 4 years ago

Is that the entirety of the comment/question? The Abacus output file will be the peptides "mapped" to Gigaton. I assume they want to know, out of all the spectra detected on the mass spec, how many actually could be identified with your database (you can include identified as a contaminant)? I think you can get this information from your peptide prophet files. You could also get total # spectra by importing your .raw files into Raw Meat.

Under filtering options, make sure that your minimum probability is set to what you used in your abacus parameters file. In the example attached, I'm pretty confident that the mass spec measured 33757 spectra. With the probability cut-off I used, 4968 were assigned to a protein in the database. That doesn't seem great, but remember that the mass spec produces spectra for pretty much anything that goes into it (this can include compounds it picks up from the personal care products that whoever is running it is wearing). Of course there will be gigas peptides that aren't identified your database for whatever reason. If you generate the percentages and want me to check some comparable projects to see how similar our numbers are, I'd be happy to do that.

shellywanamaker commented 4 years ago

That is the entirety of their comment. That would be great if you could compare the percentages I'll generate with those of your similar projects.

Could it be just as simple as taking the number from the loaded spectra section in the Comet log file (example below) and dividing that by the sum of the 'NUMSPECSTOT' column in the Abacus output for each sample?

Excerpt from Comet log file:

Search start:  02/12/2017, 12:58:44 PM
 - Input file: 20161205_Sample_11A.mzXML
   - Load spectra: 94326

emmats commented 4 years ago

Do you have documentation of what "Load spectra" is? It could be what we are looking for, but I really don't know. If you are unsure, email Jimmy Eng to ask (engj@uw.edu). If it does represent all the spectra detected by the mass spec, then your method should be good. Although NUMSPECSTOT will include redundancies since peptides are shared among proteins. I think I would use the sum of the ALL_NUMPEPSUNIQ column.

shellywanamaker commented 4 years ago

Here's what Jimmy said: "Shelly, Yes to your question. The "Load spectra" number refers to the number of tandem mass spectra that were actually read into memory and analyzed. This number will likely differ from the number of actual tandem mass spectra in the input file due to various filters (such as mass range, minimum number of peaks, etc.)."

I'm not sure it makes sense to do (sum NUMPEPSUNIQ)/total spectra, because total spectra would also include redundancies, right?

I think the reviewer may be satisfied if the manuscript includes the % spectra aligned and about the fraction of Gigaton proteins with spectral alignments/total Gigaton proteins. What do you think?

emmats commented 4 years ago

I don't think the total spectra will have redundancies. I think it will just be the number of raw spectra before they are assigned to a peptide/protein. The reviewer really should have asked a more specific question :) I say, go with what you think is the best answer. I honestly don't think it is a very important statistic.

shellywanamaker commented 4 years ago

Ah! I get what you're saying about the redundancies. I'll go with the (sum NUMPEPSUNIQ)/total spectra, I think that should be fine. I'll post the fractions once I calculate them. Thanks for your help with this Emma!

shellywanamaker commented 4 years ago

@emmats so the average fraction of unique peptides/total spectra is 0.255 +/- 0.023 (s.d.) across the 26 samples (13 biological samples with technical duplicates).

I calculated it by doing sum of NUMPEPUNIQ/total spectra (from Comet file). R code here: https://github.com/shellytrigg/paper-OysterSeed-TimeXTemp/blob/master/Analyses/Proteomics_Data_Processing/Calculate_Fraction_Peptides.Rmd

Is this similar to what you have seen for similar samples?

emmats commented 4 years ago

Please remind me which mass spec was used.

shellywanamaker commented 4 years ago

Orbitrap Fusion Lumos Mass Spectrometer

emmats commented 4 years ago

Where did you get that comet log file? I can't find anything quite like it.

shellywanamaker commented 4 years ago

The comet log file (output.error.comet.log) is from step 2 in the DDA-data-Analyses pipeline

emmats commented 4 years ago

I have done many many Comet searches and never generated a file called that. In my qsublogs folder I have output and error files for each individual comet search, but nothing that looks like your file.

kubu4 commented 4 years ago

I have done many many Comet searches and never generated a file called that. In my qsublogs folder I have output and error files for each individual comet search, but nothing that looks like your file.

It's going to be dependent upon how you executed your command. In the log file mentioned above, the command from step 2 in the DDA-data-Analyses pipeline is sending stderr and stdout to a log file.

Without that, I'm fairly certain the info is normally printed to the screen and is not captured.

emmats commented 4 years ago

Thanks @kubu4 I only ever run runCometQ without additional commands. I'll do some calculations on a handful of files for the comparison.

emmats commented 4 years ago

For a geoduck study on the Lumos it looks like I got similar results of ~25% of spectra mapping to the transcriptome.

shellywanamaker commented 4 years ago

awesome! thanks for checking Emma!

shellywanamaker commented 4 years ago

@emmats sorry to bug you again, but can you make sense of this statement I'm going to add to the manuscript? Or can it be made clearer somehow?

"Each sample showed on average 32.22 +/- 2.87% of spectra (30753 +/- 2624 out of 95,506 +/- 2,685 acquired spectra that passed Comet alignment parameter thresholds), corresponding to 24,355 +/- 2,027 unique peptides, could be mapped uniquely to the Gigaton database, with all samples collectively covering 19.6% of Gigaton (7978/40637 proteins)."

NOTE: I ended up using NUMSPECSUNIQ instead of NUMPEPSUNIQ to calculate % spectra mapped and then calculated the average total number of unique peptides the mapped spectra correspond to

emmats commented 4 years ago

It's a little awkward. Maybe try this:

The Comet search resulted in an average of 24,355 +/- 2,027 unique peptides per MS experiment, corresponding to 32.22 +/- 2.87% of the total acquired spectra, with all samples collectively covering 19.6% of Gigaton (7978 out of 40637 predicted proteins).

shellywanamaker commented 4 years ago

thanks Emma!

RobertsLab / resources

fraction of peptides mapped to Gigaton #1005