bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Implementation of Class FDR #56

Closed ypriverol closed 2 years ago

ypriverol commented 2 years ago

Class FDR is used to filter peptide identifications / PSMs using the target/decoy approach of the class they they represent, e.g pseudo_X. The method should allow filtering by specific peptide class like pseudo_ or COSMIC_ or by group of classes like mutations which group a set of classes [COSMIC, cbio].

The class-FDR should be implemented for the following file formats:

ypriverol commented 2 years ago

Comment from @yafeng :

 Hi Yasset,

Regarding the bayesian class FDR,  my GitHub script was based on the following paper:

Fan Y. et al Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. MCP PMID: 24200586

At that time, I tested a few novel searches, which I found several issues:

1. it has the tail fitting issue, which was shown in their original paper Fig 1.   it does significantly disturb the fitting result. Therefore I set the score filtering 6-10 to exclude the decoy hits the the tail region. However, this is only applicable for MSGF-plus SpecEval score.

2. in general, it is less stringent than the old fashion way (novel only decoy / novel target + decoy), though it is supposed to be more accurate especially when novel only target is rare. 
My current opinion is not to widely implement this bayesian FDR before in-depth evaluation.  
husensofteng commented 2 years ago

Thanks @Yasset for the implementation.

Regarding triqler file, since the score there is 1-PEP, I don't think it is suitable to use it for computing FDR therefore we should not support triqler output unless we can get an actual score column there. I think SVM score from percolator would be great for this purpose. https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/proteomics/openms.py#L478

husensofteng commented 2 years ago

For the class definitions: https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/toolbox/general.py#L330

I would propose to have two parts for each class: class:acceptable_prefixes![rejected_prefixes] where ! is the separator between the two parts and the second part is optional.

the peptide can be assigned to the class if any of the acceptable prefixes are found BUT it is not assigned to the class if any of the rejected prefixes are found.

This scenario allows us to differentiate canonical and other peptides easily.

An input example could be as follows:

canonical:ENSP
non_canonical:altorf_|lncRNA|pseudogene!ENSP
variant:COSMIC_|cbiomut_|var_!ENSP|altorf_|lncRNA|pseudogene

In this example the priority is first given to canonical ENSEMBLE proteins. secondly, the priority is given to non_canonical proteins and finally the variant proteins.

I think this implantation would guaranty that we don't throughout any identified peptide due to class conflicts.

husensofteng commented 2 years ago

on the Bayseian FDR https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/proteomics/openms.py#L155 The results shown in the paper look very sound. However, as @yafeng also suggested, we need to investigate further how to filter the PSMs.

Actually, we also need to implement a procedure to at least inform the user when very few peptides are available to use for computing the ordinary class specific FDR, and even better is to give the option to transform the FDRs using the Bayesian approach. Let's wait with this until we investigate the impact of filtering.

ypriverol commented 2 years ago

Thanks @Yasset for the implementation.

Regarding triqler file, since the score there is 1-PEP, I don't think it is suitable to use it for computing FDR therefore we should not support triqler output unless we can get an actual score column there. I think SVM score from percolator would be great for this purpose.

https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/proteomics/openms.py#L478

@husensofteng I will disable for now the possibility to perform the class FDR on top of the triqler output. @timosachsenberg perhaps we can export the ConsensusID score to Triqler.

timosachsenberg commented 2 years ago

Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?

ypriverol commented 2 years ago

Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?

@timosachsenberg this is actually the Bayesian implementation that we have. 😂

husensofteng commented 2 years ago

Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?

@timosachsenberg this is actually the Bayesian implementation that we have. 😂

yes it is the equation 7 and 8 from that paper that is implemented here. I rather meant a procedure to inform that not enough decoy peptides are available to compute the FDR.

timosachsenberg commented 2 years ago

hahaha great

husensofteng commented 2 years ago

Filtering thresholds have to be decided in bayesian_class_fdr, therefore this function is disabled currently: https://github.com/bigbio/py-pgatk/blob/fb1a2bc109dc272900decb2d8b8d1226be9501e5/pypgatk/proteomics/openms.py#L221.