Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
44
stars
9
forks
source link
How to turn off GTEx control and use my own custom control samples? #37
In certain case, you'd like to not use our GTEx control but use your own custom control database, the recommended way is like below:
We recommend first put both your tumor bam files and control bam files into a /bam folder, run AltAnalyze, so that you will get a combined counts.original.pruned.txt count matrix for both your tumor and control samples.
After running that, please open this altanalyze_output/AlternativeOutput/Events-dPSI_0.0_rawp/PSI.tumor_vs_control.txt within which you can select the ones that are specific to your tumor samples and are absent to your own control, you can see ave-tumor, ave-control, dPSI, adjp to customize the selection criteria. The first three values are shown as Percent Spliced In (PSI) ranging from 0-1, 0 is not present, 1 is fully present. adjp indicate adjusted pvalue (the significance level).
Once you have a list of junctions from UID columns, you just need to take the first junction, see below:
CASK:ENSG00000147044:E21.1-E23.1|ENSG00000147044:E21.1-E22.1
gene id: CASK
junction: ENSG00000147044:E21.1-E23.1 (what you need)
background junction: ENSG00000147044:E21.1-E22.1 (the competing junction, you don't need)
These junctions can be used to subset your combined counts.orginal.pruned.txt dataframe based on row. Then you should also only include the tumor columns as the final df for the SNAF run.
Finally, how to turn off the GTEx filter? The key is to use a very large number like 10000 to name your n_max and name your t_min=-10000, then it basically will retain all your junctions
2024-03-26 20:55:28 starting initialization
Current loaded gtex cohort with shape (12827, 2629)
2024-03-26 20:55:50 finishing initialization
reduce valid NeoJunction from 12927 to 12927 because they are present in GTEx
Hoping this help and feel free to reach out if you need further clarification,
Frank
In certain case, you'd like to not use our GTEx control but use your own custom control database, the recommended way is like below:
We recommend first put both your
tumor
bam files andcontrol
bam files into a/bam
folder, runAltAnalyze
, so that you will get a combinedcounts.original.pruned.txt
count matrix for both your tumor and control samples.Now you can run the differential alternative splicing (DAS) mode for AltAnalyze, please follow this section in the tutorial, and remember, the
groups.txt
andcomps.txt
need to be in the same level of wherealtanalyze_output
folder sit, and it's advisable to make sure you are in that folder when you are running the singularity command: https://snaf.readthedocs.io/en/latest/tutorial.html#differential-gene-splicing-analysis-and-gene-enrichment-analysisAfter running that, please open this
altanalyze_output/AlternativeOutput/Events-dPSI_0.0_rawp/PSI.tumor_vs_control.txt
within which you can select the ones that are specific to your tumor samples and are absent to your own control, you can seeave-tumor
,ave-control
,dPSI
,adjp
to customize the selection criteria. The first three values are shown as Percent Spliced In (PSI) ranging from 0-1, 0 is not present, 1 is fully present.adjp
indicate adjusted pvalue (the significance level).Once you have a list of junctions from
UID
columns, you just need to take the first junction, see below:These junctions can be used to subset your combined
counts.orginal.pruned.txt
dataframe based on row. Then you should also only include the tumor columns as the finaldf
for the SNAF run.Finally, how to turn off the GTEx filter? The key is to use a very large number like
10000
to name your n_max and name yourt_min=-10000
, then it basically will retain all your junctionsAs a test:
Hoping this help and feel free to reach out if you need further clarification, Frank