frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
39 stars 8 forks source link

Retreiving nucleotide coding sequence and genomic origin for peptides #45

Open spvensko opened 3 months ago

spvensko commented 3 months ago

Hello,

My understanding is that the coord column contains the coordinates of the splicing event (e.g. the coordinates may be that of a skipped exon) and not the genomic coordinates of the sequence encoding the peptide. With that in mind, is it possible to retrieve the coding sequence and genomic origin of each peptide (as in, the actual coordinates where that peptide is encoded) out of the current outputs? It appears the coding sequence is utilized at https://github.com/spvensko/SNAF/blob/v0.7.0/snaf/snaf.py#L1190, but I wanted to check with you before I try to develop my own solution.

Thanks, Steven V.

frankligy commented 3 months ago

Hi @spvensko,

I believe I have the solution for this. See below examples:

jcmq = snaf.JunctionCountMatrixQuery.deserialize('result/after_prediction.p')
uid = 'ENSG00000065609:E45.1-E47.1'
nj_list = jcmq.results[0]
for nj in nj_list:
    if nj is not None and nj.uid == uid:
        print(nj.junction)

You will get below junction sequence with , to delimit the two ends of splicing junction:

CCTCCTGCTGGGACAGGCATGCCCATGATGCCTCAGCAGCCGGTCATGTTTGCACAGCCCATGATGAGGCCCCCCTTTGGAGCTGCCGCTGTACCTGGCACGCAG,CTGCAATATTTGTGACTGAATAGGAAAATAAATGAGTTTGGAGACTTCAAATAAGATTGATGCTGAGTTTC

Let's BLAT the first and second half in UCSC genome browser:

First Screenshot 2024-06-14 at 12 50 23 PM

Second Screenshot 2024-06-14 at 12 50 40 PM

But you can also derive that using other codes, I shared one solution in this issue (https://github.com/frankligy/SNAF/issues/31.)

Hopefully this helps a bit, Frank