Closed GeorgesBed closed 3 years ago
Hi @GeorgesBed We are interested to have more developers contributing to the tool. Would you be willing to perform a PR to the code fixing the issue.
sure !
@husensofteng can you give also your input here.
Hi @GeorgesBed,
Firstly, thank you for pointing out this major issue and sorry for the very long delay in replying.
Indeed, that is undesired and it has been introduced due to the different CDS definitions as you have correctly addressed.
Since, ?
is not only used for mutations outside CDSs, I would propose to be more inclusive and set the condition as follows:
if "?" not in snp.dna_mut and snp.aa_mut!='p.?'
This solution would skip all non-CDS mutations but it would still process mutations like (COSM5857105) the following:
Transcript: ENST00000394161.9 (CDS of exon6)
Mutation CDS (dna_mut): c.497del
Mutation AA (aa_mut): p.P166Lfs*?
Type: Deletion - Frameshift
Genome coord (g38): 19:7743054-7743054
CDS
CD209_ENST00000394161 ENST00000394161.9 19:7743039-7747511(-) atgagtgactccaaggaaccaagactgcagcagctgggcctcctggaggaggaacagctgagaggccttggattccgaca gactcgaggatacaagagcttagcagggtgtcttggccatggtcccctggtgctgcaactcctctccttcacgctcttgg ctgggctccttgtccaagtgtccaaggtccccagctccataagtcaggaacaatccagaagtaaccgcttcacctggatg ggactttcagatctaaatcaggaaggcacgtggcaatgggtggacggctcacctctgttgcccagcttcaagcagtattg gaacagaggagagcccaacaacgttggggaggaagactgcgcggaatttagtggcaatggctggaacgacgacaaatgta atcttgccaaattctggatctgcaaaaagtccgcagcctcctgctccagggatgaagaacagtttctttctccagcccct gccaccccaaaccccc[C]tcctgcgtag
Protein sequence
ENSP00000377716.4 pep chromosome:GRCh38:19:7743039:7747511:-1 gene:ENSG00000090659.18 transcript:ENST00000394161.9 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CD209 description:CD209 molecule [Source:HGNC Symbol;Acc:HGNC:1641] MSDSKEPRLQQLGLLEEEQLRGLGFRQTRGYKSLAGCLGHGPLVLQLLSFTLLAGLLVQVSKVPSSISQEQSRSNRFTWMGLSDLNQEGTWQWVDGSPLLPSFKQYWNRGEPNNVGEEDCAEFSGNGWNDDKCNLAKFWICKKSAASCSRDEEQFLSPAPATPNP[PPA]
Mutated protein sequence
MSDSKEPRLQQLGLLEEEQLRGLGFRQTRGYKSLAGCLGHGPLVLQLLSFTLLAGLLVQVSKVPSSISQEQSRSNRFTWMGLSDLNQEGTWQWVDGSPLLPSFKQYWNRGEPNNVGEEDCAEFSGNGWNDDKCNLAKFWICKKSAASCSRDEEQFLSPAPATPNP[LLR]
pushed commit 79b399a
This is implemented in the dev version already. The issue will be close when the following PR #48 #50 get merged.
https://github.com/bigbio/py-pgatk/blob/03288b710f5f28b2da0db0731c34ba58b86316e8/pypgatk/cgenomes/cgenomes_proteindb.py#L71
Hey, I've used py-pgatk and i find it very useful tool. I've took a deep look at some cosmic proteins and I believe there's an undesired effect from the code
file: cgenomes_proteindb.py, line: 71
The problem
considering snp.dna_mut is the cds mutation reported by cosmic, it could contains 5'UTR, intronic and 3'UTR references
doi: 10.2353/jmoldx.2007.060081
and since the fasta file contains CDS sequences which lacks 5'UTRs introns and 3'UTRs, line 71 should be modified to
Example
The comic mutation COSV52350905 for transcript RNASEKENST00000549393, ENST00000549393.2 with a cds mutation: c.*185*209del and an unkown protein change p.?:Unknown
current behavior
original CDS
wrongly mutated CDS since the mutation is in the UTR region
wrong protein
Thank you.