frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
44 stars 8 forks source link

implement new function to get full length isoform for NeoJunction #22

Open frankligy opened 10 months ago

frankligy commented 10 months ago

Right now, SNAF-B pipeline only looks for membrane protein. But people may be interested in know the potential full-length isoform for all NeoJunctions, please implement these features.

frankligy commented 9 months ago

I have released a new version implementing this function, please pip install like below:

pip install git+https://github.com/frankligy/SNAF.git@4f7d76321c32625c1909ad059b81d646a0cd9ef5

Now right after your T antigen workflow, assuming the outdir is set to the result, then you can use the find_full_length mode to generate all possible full length isoform associated with each NeoJunction:

# initiate B pipeline
from snaf import surface
surface.initialize(db_dir=db_dir)

# get fake membrane tuples, not membrane in this case but all NeoJunctions
membrane_tuples = snaf.JunctionCountMatrixQuery.get_fake_membrane_tuples(df,add_control=add_control,outdir='result/surface_fake')

# run the B pipeline using find_full_length mode
surface.run(uids=membrane_tuples,outdir='result/surface_fake',prediction_mode='find_full_length',
            gtf=None,
            tmhmm=False,software_path=None) 

# generate result using find_full_length mode
surface.generate_full_results(outdir='result/surface_fake',mode='find_full_length',
                              freq_path='result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt',
                              validation_gtf=os.path.join(db_dir,'2021UHRRIsoSeq_SQANTI3_filtered.gtf'))
Screenshot 2024-01-30 at 2 07 20 PM

Looking for a file named sr_ffl_str3_report_None_False.txt, it looks like following,

Screenshot 2024-01-30 at 2 09 27 PM

The mRNA_sequence can be readiliy validated using BLAT tool on UCSC genome browser, using the first one for example:

Screenshot 2024-01-30 at 2 11 02 PM

Plus, If you are looking for sr_ffl_str5_report_None_False.txt file, these are the ones with long-read validation based on 10 cancer cell lines long-read data.

Please share your feedback for this new function, once being tested by users, I'll make it official in the tutorial.

Thank you, Frank