Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
205 stars 38 forks source link

Number of TMT-integrator-summarized peptides is only 10% of identifications from proteogenomics database #553

Closed MiguelCos closed 1 year ago

MiguelCos commented 2 years ago

Dear FragPipe team,

I am working with a phenotype-specific proteogenomic database created using a customized Galaxy pipeline for fetching SRA RNAseq data, genome mapping, variant calling, and database creation with CustomProDB.

I modified the headers of the custom DB for it to be consistent with the format required by Philosopher (as described in: https://github.com/Nesvilab/philosopher/wiki/How-to-Prepare-a-Protein-Database#header-formatting).

This is a TMT dataset, and I am having the problem that the number of summarized peptides present in the TMT-report represent only ~10% of the peptide identifications.

I assume this is a problem related to header formatting because when using an EBI canonical DB, the amount of summarized peptides in the TMT report is consistent with the identifications but I am having trouble understanding this issue because I have modified the headers to what I believe is following Philosopher requirements.

Suggested header format for Philosopher: >sp|P02489|CRYAA_HUMAN Alpha-crystallin A chain OS=Homo sapiens OX=9606 GN=CRYAA PE=1 SV=2

Header of one of the SAAV-containing sequences in my fasta: >sp|ENSP00000339521_G200V|ENST00000345264 ENSG00000148484 OS=Homo sapiens OX=9606 GN=RSU1 PE=1 SV=2

I have tried different modifications for the position of the different identifiers with this similar format, and it does have an impact on the number of peptides in the tmt-report, but it is always less than ~10K out of 48K identifications.

Do you have a suggestion on how to approach this issue with custom databases?

I am sharing here with you the output of one of my searches and including the custom fasta that I used for it (https://1drv.ms/u/s!ArJgSkDejwROp6tjIxHxgWpbJnXOQg?e=9r2XyI).

As always, I would tremendously appreciate your feedback on this issue.

Let me know if you need any additional information.

Best regards, Miguel

prvst commented 2 years ago

@MiguelCos Even before digging into the issue, why would you want to use the SRA for the search, and what kind of precautions you are taking to avoid the redundancy?

I did a brief look at your database and I found this:

>rev_sp|ENST00000556658|ENSG00000182175 ENSP00000456290 OS=Homo sapiens OX=9606 GN=RGMA PE=1 SV=2
CFVPLLALLPVLAGLLPRPALPLGAAARGPLDRTREYLHLKDKNSHLMKVDELAYYAALTFNVDGTTLLDFVCAQYYLDEVPLKEKCKAVATEYPFTEPATPAPSAAALRRAGTGEANTHFAQFDIQQNLPCGRLCLYLGQSDWDEVANVVEEPMRVAFTLYRGVQRVVITTGIYKAQIEVHQGSVKETIKLSNAGHKDGGNKSGDVFAAPLEDMEAQYVKQDVCEQFNKFIITLKSTATAASGPLVPTNTVQVNLYNNDILPWAGQVKCTQFRDTFTRLHPDGFLGCHTYNPTASHKHFSKEYHCIEPSDSREQSDGAPPLTRLRPQSTPGDKSCNHQSM
>rev_sp|ENST00000556658|ENSG00000182175 ENSP00000456290_A322V OS=Homo sapiens OX=9606 GN=RGMA PE=1 SV=2
CFVPLLALLPVLAGLLPRPVLPLGAAARGPLDRTREYLHLKDKNSHLMKVDELAYYAALTFNVDGTTLLDFVCAQYYLDEVPLKEKCKAVATEYPFTEPATPAPSAAALRRAGTGEANTHFAQFDIQQNLPCGRLCLYLGQSDWDEVANVVEEPMRVAFTLYRGVQRVVITTGIYKAQIEVHQGSVKETIKLSNAGHKDGGNKSGDVFAAPLEDMEAQYVKQDVCEQFNKFIITLKSTATAASGPLVPTNTVQVNLYNNDILPWAGQVKCTQFRDTFTRLHPDGFLGCHTYNPTASHKHFSKEYHCIEPSDSREQSDGAPPLTRLRPQSTPGDKSCNHQSM

The way how the prophets parse the protein headers is by splitting the entire header at the first empty space, and using the first part as an ID they call shortheader. The example above contains two different sequences, however, they will be represented by the same ID:

rev_sp|ENST00000556658|ENSG00000182175
rev_sp|ENST00000556658|ENSG00000182175

I suggest you fix this before moving on

MiguelCos commented 2 years ago

Hello @prvst , thanks for your observations.

We are using SRA sequencing data from a specific cancer tissue (glioblastoma) to generate a tissue-specific proteogenomic database including SAAVs and indels to explore the identification of variants and potential phenotype-specific variants mutations from our MS proteomics data. Would you use a different approach than using SRA RNAseq data for this?

I am removing sequences containing duplicated headers from the original fasta I get from CustomProDB. CustomProDB generates an individual header for each 'variant' sequence by appending the canonical ENSEMBL protein ID with the variant location. The 'v2' fasta in the shared fasta folder should contain non-redundant headers but it is true that the 'v3' fasta that you checked would contain redundant headers according to your observation. The search that I shared was performed with the 'v2' fasta, which does not contain redundant sequences, as far as I can see. I'm very sorry for making it confusing by keeping both fastas in there.

The duplicated sequences/headers that you observed in the 'v3' fasta correspond to the following sequences in the 'v2' fasta used for the search:

>sp|ENSP00000456290_A322V|ENST00000556658 ENSG00000182175 OS=Homo sapiens OX=9606 GN=RGMA PE=1 SV=2
>sp|ENSP00000456290|ENST00000556658 ENSG00000182175 OS=Homo sapiens OX=9606 GN=RGMA PE=1 SV=2

Which are indeed different sequences containing or not a SAAV (https://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?jobId=clustalo-I20211203-222910-0893-36354387-p1m).

I am eliminating the 'v3' fasta from the shared folder to avoid confusion.

prvst commented 2 years ago

I'm asking to be certain you are aware of the complexity in analyzing and interpreting database search results when you include sequences that are either repeated, identical or subsumed. I still see repeated headers in the ProteinProphet output, my guess is that one of the prophets are parsing your proteins in different ways, since you have several patterns in your headers, see the case below, for example:

image

Note how there are pairs of proteins with the same name.

Miguel, I suggest you format your database following the rules suggested here. By applying these rules to all your sequences, you'll be sure that no protein will be parsed in a different way. If you want to follow a simpler approach, just use on of these:

>Name Description >Name This_is_a_longer_Description

Avoid using special characters like |, :, ;

MiguelCos commented 2 years ago

Hello Felipe @prvst ,

Many thanks again for taking a look, and for your comments and suggestions. They are really appreciated.

I will restructure the headers in a way that I don't use special characters for the protein names to avoid 'confusing' the Prophets.

I realize the complexity of analyzing search results from custom databases and I do appreciate any constructive criticism because we are indeed starting to venture more frequently into proteogenomics for our (increasingly bigger) cancer cohorts and we want to do it as properly as possible.

Our current approach to process these results is to map the identified peptides into their location within the protein sequence and check if the peptides identified for 'variant' sequences are actually catching the variant region of the protein. Right now I am only confident of this approach for single amino acid variants because their location within the protein sequence is properly annotated by the DB creation tool.

So, although the DB creation tool also includes indels and probably splice variants, we are still thinking about what would be the best way to confirm their presence at peptide/protein level so we are not making any assumptions about their identification so far.

I would appreciate any comments on the matter of processing these search results as well, but regarding this GitHub issue, I would try to simply assign non-redundant and consistent IDs to the headers and create an annotation table for data analysis of SAAVs to move forward with our current approach.

Many thanks again and best wishes, Miguel

prvst commented 2 years ago

Meta proteomics, and variant searches, are probably the most challenging projects you can work on using standard DDA shotgun proteomics. It's imperative to know how to deal with the sequence similarity, and possible homology.

I've seen people using this approach you mentioned before; you can create a map between the real headers, and a sequential unique ID made by you. You search against the custom IDs, and after getting your results, you just convert them back using the map.

Good luck with your project, and please don't hesitate to ask for help again.

MiguelCos commented 2 years ago

Hello Felipe @prvst and all,

I tested annotating the headers with the simple approach >Name Description and now TMT integrator generates empty reports.

I avoided using special characters and I only appended decoys but no contaminants to have consistent headers.

I am assuming, due to this observation, that the TMT integrator parser is not able to use these kinds of headers to perform the summarization. The combined_peptides.tsv is roughly the same (in terms of ID number) for all the tests that I have performed with different headers.

I will now be corroborating that I am properly creating 'complete' headers as described here but I wanted to report on the behavior of TMT integrator when using generic/simple headers.

If you would like to take a look at the new fasta and the search results, they are at the same link shared before (https://1drv.ms/u/s!ArJgSkDejwROp6tjIxHxgWpbJnXOQg?e=9r2XyI) now labelled as v10.

I am looking forward to know your opinion.

Best, Miguel

prvst commented 2 years ago

@huiyinc, Would you be able to take a look?

MiguelCos commented 2 years ago

Hello,

After carefully testing several approaches to modify the headers, I noticed that the problem of the small percentage of summarized peptides by TMT integrator wasn't the fault of bad header formatting. The prophets and TMT integrator don't have a problem with my first customized headers; although TMT integrator indeed wasn't able to parse the simple headers.

The problem is that I was setting TMT integrator to summarize only unique peptides, and there is very likely a high amount of repetition in the database, therefore 90% of the identified peptides were kicked out.

Allowing TMT integrator to use razor peptides solved the issue.

I was looking at the problem from the wrong angle. Thankfully @prvst helped me think about how the redundancy in the database can be affecting the results.

To avoid these kinds of issues, we have been thinking about ways on how to avoid having such redundant databases after creating them from RNAseq data... For example: shrinking all the identified variants from a particular protein into a single sequence that appends only variant regions of the sequence; therefore having only 1 fasta sequence for the complete 'canonical' protein and 1 fasta sequence for only the variant regions of the same protein... This would avoid having 2 or more nearly identical sequences only varying by 1 or 2 amino acids. I am guessing that this would help avoid this 'non-uniqueness' issue and make the analysis more straightforward.

Have you seen this kind of approach described above before? Is there already a tool that can do that that I am not aware of?

huiyinc commented 2 years ago

Hi Miguel

Regarding the empty TMT reports, the issue is caused by the customized protein header in the fasta file. I assume the header format is: >proteinID gene-name (e.g., >C000010 GRAPL), is it? If yes, I would suggest to change the format to >proteinID| GN=gene-name (use '|' as the separator). @Felipe Leprevost @.***> Will Philosopher identify the gene name using this format? Thanks.

Huiyin

Miguel Cosenza-Contreras @.***> 於 2021年12月9日 週四 下午4:49寫道:

Hello,

After carefully testing several approaches to modify the headers, I noticed that the problem of the small percentage of summarized peptides by TMT integrator wasn't the fault of bad header formatting. The prophets and TMT integrator don't have a problem with my first customized headers; although TMT integrator indeed wasn't able to parse the simple headers.

The problem is that I was setting TMT integrator to summarize only unique peptides, and there is very likely a high amount of repetition in the database, therefore 90% of the identified peptides were kicked out.

Allowing TMT integrator to use razor peptides solved the issue.

I was looking at the problem from the wrong angle. Thankfully @prvst https://github.com/prvst helped me think about how the redundancy in the database can be affecting the results.

To avoid these kinds of issues, we have been thinking about ways on how to avoid having such redundant databases after creating them from RNAseq data... For example: shrinking all the identified variants from a particular protein into a single sequence that appends only variant regions of the sequence; therefore having only 1 fasta sequence for the complete 'canonical' protein and 1 fasta sequence for only the variant regions of the same protein... This would avoid having 2 or more nearly identical sequences only varying by 1 or 2 amino acids. I am guessing that this would help avoid this 'non-uniqueness' issue and make the analysis more straightforward.

Have you seen this kind of approach described above before? Is there already a tool that can do that that I am not aware of?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/553#issuecomment-989637132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAWWA4TQKHH4HRXLSGVJILUQBURFANCNFSM5JKLTGRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Hui-Yin Chang, 張彙音 Assistant Professor Department of Biomedical Sciences and Engineering National Central University, Taiwan

anesvi commented 2 years ago

Hi Huiyin Is it because we require that there is a gene name? I see more people using TMT/integrator with databases that do not have it. We should add a way to generate protein and peptide reports even if no gene name (so only Gene table would be empty) What do you think? Thanks Alexey

Get Outlook for iOShttps://aka.ms/o0ukef


From: chuiyin @.> Sent: Thursday, December 9, 2021 4:44:02 AM To: Nesvilab/FragPipe @.> Cc: Nesvizhskii, Alexey @.>; Assign @.> Subject: Re: [Nesvilab/FragPipe] Number of TMT-integrator-summarized peptides is only 10% of identifications from proteogenomics database (Issue #553)

External Email - Use Caution

Hi Miguel

Regarding the empty TMT reports, the issue is caused by the customized protein header in the fasta file. I assume the header format is: >proteinID gene-name (e.g., >C000010 GRAPL), is it? If yes, I would suggest to change the format to >proteinID| GN=gene-name (use '|' as the separator). @Felipe Leprevost @.***> Will Philosopher identify the gene name using this format? Thanks.

Huiyin

Miguel Cosenza-Contreras @.***> 於 2021年12月9日 週四 下午4:49寫道:

Hello,

After carefully testing several approaches to modify the headers, I noticed that the problem of the small percentage of summarized peptides by TMT integrator wasn't the fault of bad header formatting. The prophets and TMT integrator don't have a problem with my first customized headers; although TMT integrator indeed wasn't able to parse the simple headers.

The problem is that I was setting TMT integrator to summarize only unique peptides, and there is very likely a high amount of repetition in the database, therefore 90% of the identified peptides were kicked out.

Allowing TMT integrator to use razor peptides solved the issue.

I was looking at the problem from the wrong angle. Thankfully @prvst https://github.com/prvst helped me think about how the redundancy in the database can be affecting the results.

To avoid these kinds of issues, we have been thinking about ways on how to avoid having such redundant databases after creating them from RNAseq data... For example: shrinking all the identified variants from a particular protein into a single sequence that appends only variant regions of the sequence; therefore having only 1 fasta sequence for the complete 'canonical' protein and 1 fasta sequence for only the variant regions of the same protein... This would avoid having 2 or more nearly identical sequences only varying by 1 or 2 amino acids. I am guessing that this would help avoid this 'non-uniqueness' issue and make the analysis more straightforward.

Have you seen this kind of approach described above before? Is there already a tool that can do that that I am not aware of?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/553#issuecomment-989637132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAWWA4TQKHH4HRXLSGVJILUQBURFANCNFSM5JKLTGRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Hui-Yin Chang, 張彙音 Assistant Professor Department of Biomedical Sciences and Engineering National Central University, Taiwan

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/553#issuecomment-989679194, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM6YFYJNOFGQJCYNVENDUQB26FANCNFSM5JKLTGRA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

MiguelCos commented 2 years ago

Hi @huiyinc ,

many thanks for taking a look. I will be aware of the GN= requirement in the database headers for TMT integrator usage.

Best, Miguel

huiyinc commented 2 years ago

@ Alexey, sure. I can work on it.

@ Miguel, the current version of TMT-Integrator expects gene names, but it retrieves the information from the psm tables, not from the fasta file (so "GN=" is not a requirement usage in TMT-Integrator). It depends on how Philosopher parses gene information from the fasta files and passes it to the psm tables. @.***, can you please help explain the parsing rule in Philosopher?) Also, I just noticed that, in your PSM tables, the protein ID is the full protein header (e.g., C000010 GRAPL) while the protein is the partial protein header (e.g., C000010). I feel that the two columns are swapped, and this is the main reason why TMT-Integrator can't find the protein sequence using the protein ID. @Felipe, can you please take a look at the PSM tables? Thanks.

Huiyin

Miguel Cosenza-Contreras @.***> 於 2021年12月9日 週四 下午6:53寫道:

Hi @huiyinc https://github.com/huiyinc ,

many thanks for taking a look. I will be aware of the GN= requirement in the database headers for TMT integrator usage.

Best, Miguel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/553#issuecomment-989739377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAWWAY6BWYTS6BDP6EP5MDUQCDEJANCNFSM5JKLTGRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Hui-Yin Chang, 張彙音 Assistant Professor Department of Biomedical Sciences and Engineering National Central University, Taiwan

MiguelCos commented 2 years ago

Many thanks for the explanation @huiyinc, for now, it is nice to know how it works so I can include the proper fasta header formatting into my workflow while having into account the potential problems of non-unique peptides.