Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
208 stars 38 forks source link

FragPipe and Heterozygosity #1862

Closed chad-hyer closed 1 week ago

chad-hyer commented 2 weeks ago

Hey, I am attempting to use FragPipe to analyze some proteogenomic data for use in making structural comparisons and am running into some issues with handling instances where the individual is a heterozygote with two different sequences for a given protein. Currently, how I am handling this is by having two different sequences under suffixed headers in my FASTA. As most of the sequence is the same, enzyme cut sites are normally very similar with slightly different peptides based on mutations. Some mutations lead to different enzyme cut sites, and I assume MS Fragger handles that without issue. The issue comes with assigning peptides to proteins. I am only really interested in peptide level quantification for my use case as I am trying to probe structural differences caused by mutations.

From my first experiments with different sequences, FragPipe assigns all peptides to the first variant and then lists the second as a secondary protein in the mapped protein column. This is perfectly fine, except there seems to be issues with identifying and quantifying peptides from the second variant with the alternative header. I've noticed that my quant has also been less useful to me since introducing the second sequences (attached csvs). I haven't tested this much yet because I assume part of my problem is just that there is a better way of handling these separate variants. Do you have any suggestions on how to handle heterozygosity in FragPipe?

Thanks, Chad

heterozygous_sequences_combined_modified_peptide.tsv.csv no_heterozygous_sequences_combined_modified_peptide.tsv.csv

fcyu commented 2 weeks ago

Hi Chad,

I am not sure if I fully understand your question. But if you append the suffix to the UniProt format: sp|P19823|ITIH2_HUMAN-Var1 vs sp|P19823|ITIH2_HUMAN, the protein ID (e.g., P19823) is still the same, which might cause problems.

Could you append the suffix to both places, for example sp|P19823-Var1|ITIH2_HUMAN-Var1, and try again? Or maybe elaborate the "there seems to be issues with identifying and quantifying peptides from the second variant with the alternative header. I've noticed that my quant has also been less useful to me since introducing the second sequences (attached csvs)." a little bit more?

Best,

Fengchao

chad-hyer commented 2 weeks ago

Thanks for getting back to me! I tried appending the suffix in both places, and it still treats it the same. Regarding it being less useful to me comes from downstream processing where I use a curve fitting software tool to calculate folding stability from peptide quant, so it's more of a niche issue that I am dealing with. The dataset I used for this first test isn't necessarily the greatest, so I am going to run a different set that should give me a better reference of what may be going wrong.

I suppose my question revolves around finding the best way to achieve a robust quant with multiple sequences for the same protein. Does having multiple similar sequences reduce confidence in quantitation, or are there other factors I should consider? For the most part, I am only interested in differentiating between peptides when cases of heterozygosity matters. With this better test set, I should be able to get a better idea of if I am actually running into a problem as I have a few specific peptides of interest that I will look for. I'll update you when I get a better idea, but if you have any suggestions for me, I'd appreciate them.

Thanks, Chad

P.S. I apologize if I am not explaining my problem very well. Part of the underlying problem for me is that a lot of the way quant is done is a bit of a black box to me.

fcyu commented 2 weeks ago

Hi Chad,

I suppose my question revolves around finding the best way to achieve a robust quant with multiple sequences for the same protein. Does having multiple similar sequences reduce confidence in quantitation, or are there other factors I should consider?

If it is just about the peptide-level quant, there should be no issue most of the time. You probably could turn off MBR in case the highly similar peptides have the close retention time, mass, and charge.

Best,

Fengchao