frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
35 stars 8 forks source link

how to subset a normal control from gtex or tcga? #40

Open renyuan001 opened 3 months ago

renyuan001 commented 3 months ago

For example:

batch1 = gtex_ctrl_db[:,gtex_ctrl_db.var["tissue"] == "Pituitary"] batch1 View of AnnData object with n_obs × n_vars = 2476734 × 24 obs: 'mean', 'std' var: 'tissue', 'total_count' list(batch1.var["tissue"]) ['Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary', 'Pituitary'] batch1 View of AnnData object with n_obs × n_vars = 2476734 × 24 obs: 'mean', 'std' var: 'tissue', 'total_count'

add_control = {'gtex_ctrl':batch1} snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control) 2024-04-21 18:24:36 starting initialization Current loaded gtex cohort with shape (59696, 2629) Traceback (most recent call last): File "", line 1, in File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/init.py", line 52, in initialize adata = gtex_configuration(df,gtex_db,t_min,n_max,normal_cutoff, tumor_cutoff, normal_prevalance_cutoff, tumor_prevalance_cutoff, add_control) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/gtex.py", line 65, in gtex_configuration assert len(set(control.var_names).intersection(tissue_dict.keys())) == 0 AssertionError

add_control = {'additional_healthy':batch1} snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control) 2024-04-21 18:43:00 starting initialization Current loaded gtex cohort with shape (59696, 2629) Traceback (most recent call last): File "", line 1, in File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/init.py", line 52, in initialize adata = gtex_configuration(df,gtex_db,t_min,n_max,normal_cutoff, tumor_cutoff, normal_prevalance_cutoff, tumor_prevalance_cutoff, add_control) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/gtex.py", line 65, in gtex_configuration assert len(set(control.var_names).intersection(tissue_dict.keys())) == 0 AssertionError

What is the problem?

frankligy commented 3 months ago

Hi @renyuan001,

Your subset step is correct, the reason why it throws an error is because, by default, the complete gtex database will be used, and when users add additional cohort, I implemented a rule that the tissue types can not be the same, so that later the tumor_specificity calculation using MLE, which will consider tissue distribution, can function properly. Because of that, since the gtex database has pituitary, and your subsetted database is pituitary, so the assertion error is thrown.

It is very easy to work it around by adding a suffix to the tissue for your subsetted or other control cohort, in your case, I would do:

batch1 = gtex_ctrl_db[:,gtex_ctrl_db.var["tissue"] == "Pituitary"]
batch1.var['tissue'] = [item+'_customized' for item in batch1.var['tissue']]

Then the Assertion Error will go away.

If you'd like to completely turn off the gtex database, you can refer to this issue as well (https://github.com/frankligy/SNAF/issues/37).

If there's more customized usage you'd like to achieve, feel free to reach out!

Best, Frank

renyuan001 commented 3 months ago

Thank you for your explanation. This is a good tool indeed. When we analyisis the tumor RNA-seq.fastq, the candidate neoantigens are filtered by the whole normal tissues (such as gtex_ctrl_db and tcga_ctrl_db) may be better than filtered by our own custom control samples. Because the neoantigen maybe more safe , after the relate TCR-T cells Infused back into the human body?

frankligy commented 3 months ago

That's correct, compared to DNA mutation in which since all tissues share the same set of DNA, we only need to filter by germline WGS to confirm it will safe for the patients.

But for gene expression or RNA splicing antigens, we need to make sure the splicing junction is not highly present in normal tissues as RNA is expressed in a more tissue-specific manner.

That's why we compiled a large compendium for normal database to filter out these splicing, and allows users to append as many additional normal cohort as possible to enhance the normal database.

Best, Frank