bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
11 stars 11 forks source link

wrong cosmic proteins output #47

Closed GeorgesBed closed 3 years ago

GeorgesBed commented 3 years ago

https://github.com/bigbio/py-pgatk/blob/03288b710f5f28b2da0db0731c34ba58b86316e8/pypgatk/cgenomes/cgenomes_proteindb.py#L71

Hey, I've used py-pgatk and i find it very useful tool. I've took a deep look at some cosmic proteins and I believe there's an undesired effect from the code

file: cgenomes_proteindb.py, line: 71

 @staticmethod
  def get_mut_pro_seq(snp, seq):
    nucleotide = ["A", "T", "C", "G"]
    mut_pro_seq = ""
    # problem with the line below
    if "?" not in snp.dna_mut:  # unambiguous DNA change known in CDS sequence

The problem

considering snp.dna_mut is the cds mutation reported by cosmic, it could contains 5'UTR, intronic and 3'UTR references

image doi: 10.2353/jmoldx.2007.060081

and since the fasta file contains CDS sequences which lacks 5'UTRs introns and 3'UTRs, line 71 should be modified to

if "?" not in snp.dna_mut and "?" not in snp.aa_mut :

Example

The comic mutation COSV52350905 for transcript RNASEKENST00000549393, ENST00000549393.2 with a cds mutation: c.*185*209del and an unkown protein change p.?:Unknown

current behavior

original CDS

RNASEK_ENST00000549393 ENST00000549393.2 17:7012648-7014519(+) atgaagaagtgccggttctccctcccctcttccgcactgtcccgtgatgatgacgcctccagagaggacgataatctgggttcctgggagagatggcttggtcactattcccacccttgcctcgaccacttgtctcaatgtcaccacctcacgccctgttccaggtggctgagtccgaatccaa taatgctcggaatatttttcaatgt ccattccgctgtgttgattgaggacgttcccttcacggagaaagattttgagaatggcccccagaacatatacaacctttacgagcaagtcagctacaactgtttcatcgctgcaggcctttacctcctcctcggaggcttctctttctgccaagttcggctcaataagcgcaaggaatacatggtgcgctag

wrongly mutated CDS since the mutation is in the UTR region

RNASEK_ENST00000549393 ENST00000549393.2 17:7012648-7014519(+) atgaagaagtgccggttctccctcccctcttccgcactgtcccgtgatgatgacgcctccagagaggacgataatctgggttcctgggagagatggcttggtcactattcccacccttgcctcgaccacttgtctcaatgtcaccacctcacgccctgttccaggtggctgagtccgaatccaataatgctcggaatatttttcaatgtccattccgctgtgttgattgaggacgttcccttcacggagaaagattttgagaatggcccccagaacatatacaacctttacgagcaagtcagctacaactgtttcatcgctgcaggcctttacctcctcctcggaggcttctctttctgccaagttcggctcaataagcgcaaggaatacatggtgcgctag

wrong protein

COSV52350905:RNASEKENST00000549393:ENST00000549393.2:c.*185*209del:p.?:Unknown MKKCRFSLPSSALSRDDDASREDDNLGSWERWLGHYSHPCLDHLSQCHHLTPCSRWLSPNPTIPLC*LRTFPSRRKILRMAPRTYTTFTSKSATTVSSLQAFTSSSEASLSAKFGSISARNTWCA

Thank you.

ypriverol commented 3 years ago

Hi @GeorgesBed We are interested to have more developers contributing to the tool. Would you be willing to perform a PR to the code fixing the issue.

GeorgesBed commented 3 years ago

sure !

ypriverol commented 3 years ago

@husensofteng can you give also your input here.

husensofteng commented 3 years ago

Hi @GeorgesBed,

Firstly, thank you for pointing out this major issue and sorry for the very long delay in replying.

Indeed, that is undesired and it has been introduced due to the different CDS definitions as you have correctly addressed.

Since, ? is not only used for mutations outside CDSs, I would propose to be more inclusive and set the condition as follows: if "?" not in snp.dna_mut and snp.aa_mut!='p.?'

This solution would skip all non-CDS mutations but it would still process mutations like (COSM5857105) the following:

Transcript: ENST00000394161.9 (CDS of exon6) Mutation CDS (dna_mut): c.497del Mutation AA (aa_mut): p.P166Lfs*?
Type: Deletion - Frameshift Genome coord (g38): 19:7743054-7743054

CDS

CD209_ENST00000394161 ENST00000394161.9 19:7743039-7747511(-) atgagtgactccaaggaaccaagactgcagcagctgggcctcctggaggaggaacagctgagaggccttggattccgaca gactcgaggatacaagagcttagcagggtgtcttggccatggtcccctggtgctgcaactcctctccttcacgctcttgg ctgggctccttgtccaagtgtccaaggtccccagctccataagtcaggaacaatccagaagtaaccgcttcacctggatg ggactttcagatctaaatcaggaaggcacgtggcaatgggtggacggctcacctctgttgcccagcttcaagcagtattg gaacagaggagagcccaacaacgttggggaggaagactgcgcggaatttagtggcaatggctggaacgacgacaaatgta atcttgccaaattctggatctgcaaaaagtccgcagcctcctgctccagggatgaagaacagtttctttctccagcccct gccaccccaaaccccc[C]tcctgcgtag

Protein sequence

ENSP00000377716.4 pep chromosome:GRCh38:19:7743039:7747511:-1 gene:ENSG00000090659.18 transcript:ENST00000394161.9 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CD209 description:CD209 molecule [Source:HGNC Symbol;Acc:HGNC:1641] MSDSKEPRLQQLGLLEEEQLRGLGFRQTRGYKSLAGCLGHGPLVLQLLSFTLLAGLLVQVSKVPSSISQEQSRSNRFTWMGLSDLNQEGTWQWVDGSPLLPSFKQYWNRGEPNNVGEEDCAEFSGNGWNDDKCNLAKFWICKKSAASCSRDEEQFLSPAPATPNP[PPA]

Mutated protein sequence

MSDSKEPRLQQLGLLEEEQLRGLGFRQTRGYKSLAGCLGHGPLVLQLLSFTLLAGLLVQVSKVPSSISQEQSRSNRFTWMGLSDLNQEGTWQWVDGSPLLPSFKQYWNRGEPNNVGEEDCAEFSGNGWNDDKCNLAKFWICKKSAASCSRDEEQFLSPAPATPNP[LLR]

husensofteng commented 3 years ago

pushed commit 79b399a

ypriverol commented 3 years ago

This is implemented in the dev version already. The issue will be close when the following PR #48 #50 get merged.