Closed ypriverol closed 2 years ago
Comment from @yafeng :
Hi Yasset,
Regarding the bayesian class FDR, my GitHub script was based on the following paper:
Fan Y. et al Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. MCP PMID: 24200586
At that time, I tested a few novel searches, which I found several issues:
1. it has the tail fitting issue, which was shown in their original paper Fig 1. it does significantly disturb the fitting result. Therefore I set the score filtering 6-10 to exclude the decoy hits the the tail region. However, this is only applicable for MSGF-plus SpecEval score.
2. in general, it is less stringent than the old fashion way (novel only decoy / novel target + decoy), though it is supposed to be more accurate especially when novel only target is rare.
My current opinion is not to widely implement this bayesian FDR before in-depth evaluation.
Thanks @Yasset for the implementation.
Regarding triqler file, since the score there is 1-PEP, I don't think it is suitable to use it for computing FDR therefore we should not support triqler output unless we can get an actual score column there. I think SVM score from percolator would be great for this purpose. https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/proteomics/openms.py#L478
For the class definitions: https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/toolbox/general.py#L330
I would propose to have two parts for each class:
class:acceptable_prefixes![rejected_prefixes]
where !
is the separator between the two parts and the second part is optional.
the peptide can be assigned to the class if any of the acceptable prefixes are found BUT it is not assigned to the class if any of the rejected prefixes are found.
This scenario allows us to differentiate canonical and other peptides easily.
An input example could be as follows:
canonical:ENSP
non_canonical:altorf_|lncRNA|pseudogene!ENSP
variant:COSMIC_|cbiomut_|var_!ENSP|altorf_|lncRNA|pseudogene
In this example the priority is first given to canonical ENSEMBLE proteins. secondly, the priority is given to non_canonical proteins and finally the variant proteins.
I think this implantation would guaranty that we don't throughout any identified peptide due to class conflicts.
on the Bayseian FDR https://github.com/bigbio/py-pgatk/blob/47daace0009ac814d2280ab294254f5de6f33209/pypgatk/proteomics/openms.py#L155 The results shown in the paper look very sound. However, as @yafeng also suggested, we need to investigate further how to filter the PSMs.
Actually, we also need to implement a procedure to at least inform the user when very few peptides are available to use for computing the ordinary class specific FDR, and even better is to give the option to transform the FDRs using the Bayesian approach. Let's wait with this until we investigate the impact of filtering.
Thanks @Yasset for the implementation.
Regarding triqler file, since the score there is 1-PEP, I don't think it is suitable to use it for computing FDR therefore we should not support triqler output unless we can get an actual score column there. I think SVM score from percolator would be great for this purpose.
@husensofteng I will disable for now the possibility to perform the class FDR on top of the triqler output. @timosachsenberg perhaps we can export the ConsensusID score to Triqler.
Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?
Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?
@timosachsenberg this is actually the Bayesian implementation that we have. 😂
Another way to transfer FDR to small groups has been described in: https://www.mcponline.org/article/S1535-9476(20)33109-1/fulltext and which looks rather simple to implement at first glance. Does anyone have experience with that approach?
@timosachsenberg this is actually the Bayesian implementation that we have. 😂
yes it is the equation 7 and 8 from that paper that is implemented here. I rather meant a procedure to inform that not enough decoy peptides are available to compute the FDR.
hahaha great
Filtering thresholds have to be decided in bayesian_class_fdr, therefore this function is disabled currently: https://github.com/bigbio/py-pgatk/blob/fb1a2bc109dc272900decb2d8b8d1226be9501e5/pypgatk/proteomics/openms.py#L221.
Class FDR is used to filter peptide identifications / PSMs using the target/decoy approach of the class they they represent, e.g
pseudo_X
. The method should allow filtering by specific peptide class likepseudo_
orCOSMIC_
or by group of classes like mutations which group a set of classes [COSMIC
,cbio
].The class-FDR should be implemented for the following file formats: