frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
38 stars 8 forks source link

KeyError: "None of ['query'] are in the columns" #42

Closed renyuan001 closed 3 months ago

renyuan001 commented 3 months ago

import os,sys import pandas as pd import numpy as np import anndata as ad import snaf df = pd.read_csv('/home/ry-03/data/SNAF/altanalyze_output/ExpressionInput/counts.original.pruned.txt',index_col=0,sep='\t') db_dir = '/home/ry-03/data/SNAF/data' netMHCpan_path = '/home/ry-03/data/SNAF/netMHCpan-4.1/netMHCpan' tcga_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','tcga_matched_control_junction_count.h5ad')) gtex_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','GTEx_junction_counts.h5ad')) add_control = {'tcga_control':tcga_ctrl_db}

snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control) 2024-05-12 20:09:18 starting initialization Current loaded gtex cohort with shape (56692, 2629) Adding cohort tcga_control with shape (54813, 705) to the database now the shape of control db is (56999, 3334) 2024-05-12 20:10:10 finishing initialization

jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=40,add_control=add_control,outdir='result') reduce valid NeoJunction from 57300 to 9158 because they are present in GTEx reduce valid Neojunction from 9158 to 7200 because they are present in added control tcga_control

sample_to_hla = pd.read_csv('sample_hla.txt',sep='\t',index_col=0)['hla'].to_dict() hlas = [hla_string.split(',') for hla_string in df.columns.map(sample_to_hla)]

jcmq.run(hlas=hlas,outdir='./result') junction_count_matrix: (57300, 66) cores: 30 valid: 7200 invalid: 50100 cond_df: (57300, 66) subset: (7200, 66) translated: list of 7200 nj objects cond_subset_df: (7200, 66) results: list of length 2

snaf.JunctionCountMatrixQuery.generate_results(path='./result/after_prediction.p',outdir='./result') adding gene symbol Traceback (most recent call last): File "", line 1, in File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 475, in generate_results enhance_frequency_table(df,True,True,outdir,'frequency_stage{}_verbosity1_uid_gene_symbol_coord_mean_mle.txt'.format(stage)) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 1422, in enhance_frequency_table df = add_gene_symbol_frequency_table(df=df,remove_quote=remove_quote) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py", line 894, in add_gene_symbol_frequency_table symbol_list = ensemblgene_to_symbol(ensg_list,'human') File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py", line 922, in ensemblgene_to_symbol out = mg.querymany(query,scopes='ensemblgene',fileds='symbol',species=species,returnall=True,as_dataframe=True,df_index=True) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py", line 599, in _querymany out = self._dataframe(out, dataframe, df_index=df_index) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py", line 172, in _dataframe df = df.set_index("query") File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/ry-03/miniconda3/envs/SNAF/lib/python3.7/site-packages/pandas/core/frame.py", line 5451, in set_index raise KeyError(f"None of {missing} are in the columns") KeyError: "None of ['query'] are in the columns"

[1] make sure netMHCpan path is set correctly netMHCpan_path = '/home/ry-03/data/SNAF/netMHCpan-4.1/netMHCpan'

[2] make sure HLA allele format is correct sample hla LYB.bed HLA-A24:02,HLA-A11:01,HLA-B40:01,HLA-B15:11,HLA-C03:03,HLA-C15:02

The colname of df is the same order of the rowname of sample_hla.txt.

The error still encountered and burden_stage3.txt frequency_stage3.txt x_neoantigen_frequency_stage3.pdf x_occurence_frequency_stage3.pdf These files were empty.

frankligy commented 3 months ago

Hi @renyuan001,

Based on the information you provided, it seems that the HLA alleles format are missing a "*", see this section in the tutorial (https://snaf.readthedocs.io/en/latest/tutorial.html#identify-mhc-bound-neoantigens-t-antigen). Let me know if that solves the problem, if not let me know we can dig further.

Best, Frank

renyuan001 commented 3 months ago

I check the sample_hla.txt. sample hla LYB.bed HLA-A24:02,HLA-A11:01,HLA-B40:01,HLA-B15:11,HLA-C03:03,HLA-C15:02 FF.bed HLA-A24:02,HLA-A02:07,HLA-B46:01,HLA-B40:01,HLA-C03:04,HLA-C01:02 ZCZ.bed HLA-A26:01,HLA-A24:02,HLA-B35:01,HLA-B59:01,HLA-C03:03,HLA-C01:02 FYW.bed HLA-A02:01,HLA-A30:01,HLA-B58:01,HLA-B51:01,HLA-C07:04,HLA-C03:02 LLM.bed HLA-A33:03,HLA-A30:01,HLA-B44:03,HLA-B58:01,HLA-C14:03,HLA-C03:02 WXM.bed HLA-A33:03,HLA-A02:03,HLA-B38:02,HLA-B13:01,HLA-C07:02,HLA-C03:04 tumor03.bed HLA-A30:01,HLA-A03:01,HLA-B35:01,HLA-B13:02,HLA-C06:02,HLA-C04:01

renyuan001 commented 3 months ago

When I submit the content here, the ""can't display, but the "" were included in the sample_hla.txt.

frankligy commented 3 months ago

Hi @renyuan001

Are you trying to say the asterick is in the sample file?

If so, any chance the netMHCpan was not properly set up? I don't know if you can access the youtube video I recorded for setting up the netMHCpan (https://www.youtube.com/watch?v=KrAzbR5mRIQ), basically make sure you download the /data and modify your netMHCpan script accordingly.

Your netMHCpan 4.1 folder:

Screenshot 2024-05-16 at 1 58 36 PM

The netMHCpan script:

Screenshot 2024-05-16 at 2 02 06 PM

Let me know if that solves the problem.

Best, Frank

renyuan001 commented 3 months ago

批注 2024-05-17 023018 批注 2024-05-17 023129 批注 2024-05-17 023209 批注 2024-05-17 023422

frankligy commented 3 months ago

Hi @renyuan001,

It indeed seems you did everything. Would you be comfortable sharing the counts file and the sample_hla file (I guess it's the same as you showed here) to me, and I'll test it on my end?

You can send me an email to (guangyuan.li@nyulangone.org) if the data is meant to be private.

Best, Frank