frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
44 stars 9 forks source link

How to turn off GTEx control and use my own custom control samples? #37

Open frankligy opened 8 months ago

frankligy commented 8 months ago

In certain case, you'd like to not use our GTEx control but use your own custom control database, the recommended way is like below:

We recommend first put both your tumor bam files and control bam files into a /bam folder, run AltAnalyze, so that you will get a combined counts.original.pruned.txt count matrix for both your tumor and control samples.

Now you can run the differential alternative splicing (DAS) mode for AltAnalyze, please follow this section in the tutorial, and remember, the groups.txt and comps.txt need to be in the same level of where altanalyze_output folder sit, and it's advisable to make sure you are in that folder when you are running the singularity command: https://snaf.readthedocs.io/en/latest/tutorial.html#differential-gene-splicing-analysis-and-gene-enrichment-analysis

..
  altanalyze_output
  bam
  bed
  groups.txt
  comps.txt

After running that, please open this altanalyze_output/AlternativeOutput/Events-dPSI_0.0_rawp/PSI.tumor_vs_control.txt within which you can select the ones that are specific to your tumor samples and are absent to your own control, you can see ave-tumor, ave-control, dPSI, adjp to customize the selection criteria. The first three values are shown as Percent Spliced In (PSI) ranging from 0-1, 0 is not present, 1 is fully present. adjp indicate adjusted pvalue (the significance level).

Once you have a list of junctions from UID columns, you just need to take the first junction, see below:

CASK:ENSG00000147044:E21.1-E23.1|ENSG00000147044:E21.1-E22.1

gene id: CASK
junction: ENSG00000147044:E21.1-E23.1 (what you need)
background junction: ENSG00000147044:E21.1-E22.1 (the competing junction, you don't need)

These junctions can be used to subset your combined counts.orginal.pruned.txt dataframe based on row. Then you should also only include the tumor columns as the final df for the SNAF run.

Finally, how to turn off the GTEx filter? The key is to use a very large number like 10000 to name your n_max and name your t_min=-10000, then it basically will retain all your junctions

df = pd.read_csv('counts.original.pruned.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=None,
                        n_max=10000,t_min=-10000)
jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=8,add_control=None,outdir='result_new')

As a test:

2024-03-26 20:55:28 starting initialization
Current loaded gtex cohort with shape (12827, 2629)
2024-03-26 20:55:50 finishing initialization
reduce valid NeoJunction from 12927 to 12927 because they are present in GTEx

Hoping this help and feel free to reach out if you need further clarification, Frank