frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
35 stars 8 forks source link

Making customized control #34

Open thestarocean opened 4 months ago

thestarocean commented 4 months ago

SNAF is amazing with its integration of a entire workflow of identification of neoantigen. However, I wonder if there is a way to create custom control dataset? It will be super helpful for researcher with their own normal sample sequencing dataset.

frankligy commented 4 months ago

Hi @thestarocean,

Thanks for bringing this up, yes that's totally doable and it a highlight for our tool.

Let's imagine you have a 50 samples as a healthy control, the bam files associated with each sample are located in a /bam folder. You just need to run AltAnalyze as you normally do and get the counts.pruned.txt matrix.

Now, you can use the add_control to include as many additional controls as possible, there are two acceptable formats, if these 50 samples are from same tissue type, the count matrix itself should suffice.

df_healthy = pd.read_csv('path/to/counts.pruned.txt',sep='\t',index_col=0)
add_control = {'additional_healthy':df_healthy}
snaf.initialize(df=your_tumor_df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control)

If you are working with a large scale normal reference iwth mixed tissue types, we accept a AnnData object (https://anndata.readthedocs.io/en/latest/). Let's imagine you have two tissue types, df1_liver.txt, df2_lung.txt which obtained using AltAnalyze, the benefit of using anndata in this case is the tumor specificity inference will take into account tissue distribution so a more accurate tumor specificity MLE score can be calculated:

import anndata as ad
from scipy.sparse import csr_matrix
import pandas as pd
df_liver = pd.read_csv('df1_liver.txt',sep='\t',index_col=0)
df_lung = pd.read_csv('df1_lung.txt',sep='\t',index_col=0)
df_combine = pd.concat([df_liver,df_lung],axis=1,join='outer',keys='tissue',levels=['liver_add','lung_add']).fillna(value=0)
list_tissue_type = df_combine.columns.get_level_values(-2)  # [liver_add, liver_add, liver_add,...lung_add,lung_add]
list_sample_name = df_combine.columns.get_level_values(-1). # [sample1,sample2...samplen]
adata = ad.AnnData(X=csr_matrix(df_combine.values),obs=pd.DataFrame(index=df_combine.index),var=pd.DataFrame(index=list_sample_name))
adata.var['tissue'] = list_tissue_type

# then the same
add_control = {'additional_healthy':adata}

Happy to further clarify, Frank

thestarocean commented 3 months ago

Thank you very much for your answer!