compomics / searchgui

Highly adaptable common interface for proteomics search and de novo engines
http://compomics.github.io/projects/searchgui.html
42 stars 15 forks source link

Workflow for large database search #308

Closed umutcakir closed 3 years ago

umutcakir commented 3 years ago

Dear community, I need to do perform a search with a large database. My database contains about 1 million sequences. Is it appropriate to use SearchGUI and PeptideShaker for a large database? Because the number of identifications decreases due to database size inflation, I developed the following strategy for large database search: I split the database into 10 parts (each of them has 100,000 sequences) and each part is used for SearchGUI and PeptideShaker (first searches). Then, identified proteins were concatenated and they are used for the second search. The second search is needed because the same spectrum may be matched with different proteins in different first searches. For instance, a spectrum may be matched with protein X in the 4th fraction and protein Y in the 7th fraction. I think a second search is needed to overcome multiple protein matching for the same spectrum issue. Because the number of sequences in the database in the second search will be smaller (<100,000), the number of identified proteins will be maximized. Note that, same mgf files and parameters are used in each search (10 first searches and 1 second search).
I searched on the literature, but I cannot find a comprehensive workflow for a large database search. Do you think my strategy for a large database is appropriate? I look forward to hearing your opinion and your feedback.

lnnrt commented 3 years ago

Hello, the problem you are faced with is a pretty common one, and the strategy you use is also quite widely used to somehow boost the identification rate. However, it is likely not a particularly foolproof strategy as you'll likely lose control of your FDR this way. Two things can happen that mess up this FDR control. First is that you might eliminate close but not exact matches, which can effect the estimation of e-value for each PSM (as this, depending on search engine, can be based on the score distribution of all matches for that spectrum across all potential peptides; by eliminating many of the incorrect (but potentially still relatively high scoring) peptides in your second pass, the e-value for the top peptide will be artificially increased. Moreover, you may also be eliminating decoy sequences much more effectively than forward sequences, which will mess up the decoy score distribution, again with the very real risk to artificially (and fundamentally incorrectly) increase the confidence of your identifications.

However, you could consider running the Search All, Assess Subset approach (documented here: https://pubmed.ncbi.nlm.nih.gov/28661493/ and online tool is here: http://iomics.ugent.be/saas/).

Hope this helps!

Cheers,

lnnrt.

umutcakir commented 3 years ago

Hello, the problem you are faced with is a pretty common one, and the strategy you use is also quite widely used to somehow boost the identification rate. However, it is likely not a particularly foolproof strategy as you'll likely lose control of your FDR this way. Two things can happen that mess up this FDR control. First is that you might eliminate close but not exact matches, which can effect the estimation of e-value for each PSM (as this, depending on search engine, can be based on the score distribution of all matches for that spectrum across all potential peptides; by eliminating many of the incorrect (but potentially still relatively high scoring) peptides in your second pass, the e-value for the top peptide will be artificially increased. Moreover, you may also be eliminating decoy sequences much more effectively than forward sequences, which will mess up the decoy score distribution, again with the very real risk to artificially (and fundamentally incorrectly) increase the confidence of your identifications.

However, you could consider running the Search All, Assess Subset approach (documented here: https://pubmed.ncbi.nlm.nih.gov/28661493/ and online tool is here: http://iomics.ugent.be/saas/).

Hope this helps!

Cheers,

lnnrt.

Thank you for your very informative reply. It helps me a lot.