annotation issues - Githubissues

bzhanglab / PepQuery

PepQuery: a targeted peptide search engine

http://pepquery.org

GNU General Public License v3.0

9 stars 0 forks source link

annotation issues #19

Closed summerghw closed 4 years ago

summerghw commented 4 years ago

Hi, I was working on some WES(WXS) data and mzml data. Basicly i follewed the Variant Peptide Identification procedure in this paper 《Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma》, use custmdbj to creat Patient-Specific Protein Sequence Database, use MSGF+ to search and use Pepquery to validation novel peptides and use PDV to visualization.

Could i know how you annotating these peptides in psm_rank.txt. Because i tried to map psms in psm_rank.txt back to annotated vcf file, but one peptide squence can match more than one protein sequence in custombdj created Patient-Specific Protein Sequence Database, and one protein sequence matched in hg19_refGene.txt could match more than one mutated position in vcf file. I did not figure out how to wipe off these false positive.

wenbostar commented 4 years ago

It's possible that multiple isoforms of the same gene cover the same exon that contains a non-synonymous variant. If this is the case, then the variant peptide covers this variant is likely to be able to map to multiple proteins. In addition, it's also possible that multiple different variants can generate the same variant peptide.

summerghw commented 4 years ago

@wenbostar thank you for reply. I 've got some new problem, when i try to map psms in psm_rank.txt back to annotated vcf file , should i use a variate to replace "I" and "L" to get more reselut , because they have same mass. And what reserch or analysis can be use to deal with the novel peptides i have got. thank you again.

wenbostar commented 4 years ago

What is the input for PepQuery search?

summerghw commented 4 years ago

/lustre/user/lixue/tools/jdk1.8.0_221/bin/java -jar /lustre/user/lixue/tools/PepQuery_v1.3.0/pepquery-1.3.jar -pep /lustre/user/lixue/output/WES/tmp_output/mzml/FUSCCTNBC113.pep -db /lustre/user/lixue/output/WES/tmp_output/mzml/uniprot_sprot.fasta -ms /lustre/user/lixue/output/WES/tmp_output/mzml/FUSCCTNBC113.mgf -o /lustre/user/lixue/output/WES/tmp_output/mzml/FUSCCTNBC113-tolu 10ppm -cpu 4 -prefix FUSCCTNBC113 -pep was the MS-GF+ result(*.tsv), i use the peptides row and delete the (+57.021)

wenbostar commented 4 years ago

It looks you’re using ms+gf as the search engine and variant calls from WES data. So I would suggest you to use our new pipeline neoflow to do variant peptide identification. It takes input of vcf and ms/ms data as input for variant peptide identification and PepQuery is implemented in neoflow for validation.

wenbostar commented 4 years ago

Please feel free to let me know if you have any question when you use neoflow.

summerghw commented 4 years ago

thank you，i will try this neoflow, It seems much easier. In fact, i already got the pepqurey reslut including psm_rank.txt psm_rank.mgf psm_annotation.txt etc. but i want to map them back to the merge-varInfo.txt(created by customprodbj) to know the function of novel peptides. And i dont know is that correct to use a variate to replace "I" and "L" to get more reselut , because they have same mass.

wenbostar commented 4 years ago

As you said, a variant peptide derived from I2L or L2I change cannot be confidently distinguished from its wild type peptide since amino acids I and L have almost the identical mass. So I would suggest to remove those variants.

wenbostar commented 4 years ago

Actually, a variant peptide derived from I2L or L2I change will not pass PepQuery validation if the reference database provided to PepQuery contains the wild type sequence of the variant peptide.

summerghw commented 4 years ago

thank you. after filter P value <=0.01 , all of samples less than 100 novel peptides, is that normal ?

wenbostar commented 4 years ago

You shoud use pvalue <= 0.01 and n_ptm == 0.

summerghw commented 4 years ago

Is n_ptm represent number of peptides modification?

wenbostar commented 4 years ago

Please find the explanation at PepQuery online document.

wenbostar commented 4 years ago

It looks you're using PepQuery 1.3. I recommend you to use the latest version 1.4.1. In the new version, there is a new column "confident" in psm_rank.txt file. You can directly use this to filter the identification result.

column "confident": Yes, the PSM is confident, it means this identification passed all the validation steps in PepQuery; No, the PSM is not confident.

summerghw commented 4 years ago

thank you 。

wenbostar commented 4 years ago

How did you filter your identification results?

summerghw notifications@github.com于2020年3月2日周一上午12:11写道：

@wenbostar https://github.com/wenbostar Hello, i have done all the analysis steps follewed by the method in《Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma》， but i only find few novel peptides, so i tried to repeat the analysis use data in this paper, but unlike the proteomic data on CPTAC, the genomic data (vcf) was controlled on GDC data portal. Would you like to provide me some open data set or paper that i can validate my work. I would appreciate it.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bzhanglab/PepQuery/issues/19?email_source=notifications&email_token=ABIOOXRPFJTLEPB4UZXZLXTRFNE2LA5CNFSM4KI5I4M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENOBM6Q#issuecomment-593237626, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOOXV4MBS4QDH6WWIHUN3RFNE2LANCNFSM4KI5I4MQ .

summerghw commented 4 years ago

@wenbostar 可以用中文吗，我用了pepquery的结果卡了p值，但是其实在卡p值之前的结果也非常少了。我想有可能的原因有如下几条： 1.在refrence 参数中我把uniport 和 refseq 合在了一起。 2.我只用mutect2 call了变异，文章中用了四种不同的软件，将结果合在了一起。 3.我的蛋白质数据深度不够。非常感谢您的回复。

wenbostar commented 4 years ago

Somatic variant 肽段确实是比较难鉴定到的。根据我们之前的经验，一般在几条到几十条不等。有些样本也就几条，有些甚至没有。跟样本基于DNA数据检测到的somatic variant 数量也有很大关系。有些样本本身的mutation load就很低，在这些样本里检测不到相应肽段或者很少那是比较正常的。

summerghw notifications@github.com于2020年3月2日周一下午8:48写道：

@wenbostar https://github.com/wenbostar 可以用中文吗，我用了pepquery的结果卡了p值，但是其实在卡p值之前的结果也非常少了。我想有可能的原因有如下几条： 1.在refrence 参数中我把uniport 和 refseq 合在了一起。 2.我只用mutect2 call了变异，文章中用了四种不同的软件，将结果合在了一起。 3.我的蛋白质数据深度不够。非常感谢您的回复。

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/bzhanglab/PepQuery/issues/19?email_source=notifications&email_token=ABIOOXQ5LYYNNA5543AX3D3RFRVXHA5CNFSM4KI5I4M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENR327A#issuecomment-593739132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIOOXUN4WW3ZJIRP5DAZHLRFRVXHANCNFSM4KI5I4MQ .

wenbostar commented 4 years ago

如果有需要，我可以帮你看一下你做MS-GF+和PepQuery 的参数是否合适。

summerghw commented 4 years ago

@wenbostar 非常感谢您耐心的回复， MSGF+ ：java -Xmx3500M -jar /lustre/user/lixue/tools/MSGFPlus/MSGFPlus.jar -s /lustre/user/lixue/output/WES/tmp_output/mzml/sample.mzML -d /lustre/user/lixue/output/WES/tmp_output/7-MUTECT2/sample/merge-var.fasta -inst 1 -t 10ppm -ntt 2 -tda 1 -o /lustre/user/lixue/output/WES/tmp_output/mzml/sample.mzid pepquery ：/lustre/user/lixue/tools/jdk1.8.0_221/bin/java -jar /lustre/user/lixue/tools/pepquery-1.4.1/pepquery-1.4.1.jar -pep /lustre/user/lixue/output/WES/tmp_output/mzml/sample.pep -db /lustre/user/lixue/output/WES/tmp_output/mzml/ref.fasta -ms /lustre/user/lixue/output/WES/tmp_output/mzml/sample.mgf -o /lustre/user/lixue/output/WES/tmp_output/mzml/sample -tol 10 -tolu ppm -cpu 4 -prefix sample

wenbostar commented 4 years ago

这个质谱数据是什么类型机器产生的? 是非标记还是标记数据？MS-GF+你没有设置修饰？PeoQuery 你也没有设置修饰，默认的修饰不一定与你的数据匹配。

summerghw commented 4 years ago

我的质谱数据是laberl free的，实验方法大部分是profiling，少数是QE，这个只影响msgf+ 中的inst参数对吗，修饰的话，我是否应该设置以下修饰： Oxidation (M) , Acetyl (Protein N-term)

wenbostar commented 4 years ago

profiling是什么意思？

summerghw commented 4 years ago

@wenbostar ，我查了一下，应该是大部分的实验仪器是fusion，少部分是QE。 profiling应该是实验方法，profiling是没有在实验过程中添加任何修饰，所以修饰参数应该不用特意添加。 TFER是转录因子，会添加Acetyl (Protein N-term), DeStreak (C)， Oxidation (M)，TiO2则会添加 Acetyl (Protein N-term), Oxidation (M), Phospho (ST), Phospho (Y)。

wenbostar commented 4 years ago

对于常规蛋白组（global proteome）实验，通常需要添加Carbamidomethylation of C作为固定修饰(根据实验处理不同，可能会是另外的修饰)，如果实验无特殊处理，一般也会加上Oxidation of M，Acetyl of Protein N-term等。在做MS-GF+和PepQuery的时候都需要加上这些修饰，他们对鉴定结果影响可能是比较大的。如果有磷酸化肽段富集，则在可变修饰上也加上Phospho (ST), Phospho (Y)。