frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
35 stars 8 forks source link

snaf.JunctionCountMatrixQuery.generate_results: KeyError: "None of ['query'] are in the columns" #25

Open infWang opened 6 months ago

infWang commented 6 months ago

Thank you very much for your excellent work. I encountered some errors while running my own data following the tutorial. Here is the error log: snaf.JunctionCountMatrixQuery.generate_results(path='./zhangjiang_data/sanf_res/after_prediction.p',outdir='./zhangjiang_data/sanf_res/')

adding gene symbol
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_73240/1515722242.py in <module>
----> 1 snaf.JunctionCountMatrixQuery.generate_results(path='./zhangjiang_data/sanf_res/after_prediction.p',outdir='./zhangjiang_data/sanf_res/')

~/anaconda3/envs/snaf/lib/python3.7/site-packages/snaf/snaf.py in generate_results(path, outdir, criterion)
    454             # add additional attributes
    455             df = pd.read_csv(os.path.join(outdir,'frequency_stage{}_verbosity1_uid.txt'.format(stage)),sep='\t',index_col=0)
--> 456             enhance_frequency_table(df,True,True,outdir,'frequency_stage{}_verbosity1_uid_gene_symbol_coord_mean_mle.txt'.format(stage))
    457             # report candidates
    458             if stage == 3:

~/anaconda3/envs/snaf/lib/python3.7/site-packages/snaf/snaf.py in enhance_frequency_table(df, remove_quote, save, outdir, name)
   1395     '''
   1396     print('adding gene symbol')
-> 1397     df = add_gene_symbol_frequency_table(df=df,remove_quote=remove_quote)
   1398     print('adding chromosome coordinates')
   1399     df = add_coord_frequency_table(df,remove_quote=False)

~/anaconda3/envs/snaf/lib/python3.7/site-packages/snaf/downstream.py in add_gene_symbol_frequency_table(df, remove_quote)
    891         df['samples'] = [literal_eval(item) for item in df['samples']]
    892     ensg_list = [item.split(',')[1].split(':')[0] for item in df.index]
--> 893     symbol_list = ensemblgene_to_symbol(ensg_list,'human')
    894     df['symbol'] = symbol_list
    895     return df

~/anaconda3/envs/snaf/lib/python3.7/site-packages/snaf/downstream.py in ensemblgene_to_symbol(query, species)
    919     import mygene
    920     mg = mygene.MyGeneInfo()
--> 921     out = mg.querymany(query,scopes='ensemblgene',fileds='symbol',species=species,returnall=True,as_dataframe=True,df_index=True)
    922 
    923     df = out['out']

~/anaconda3/envs/snaf/lib/python3.7/site-packages/biothings_client/base.py in _querymany(self, qterms, scopes, **kwargs)
    597 
    598         if dataframe:
--> 599             out = self._dataframe(out, dataframe, df_index=df_index)
    600             li_dup_df = DataFrame.from_records(li_dup, columns=["query", "duplicate hits"])
    601             li_missing_df = DataFrame(li_missing, columns=["query"])

~/anaconda3/envs/snaf/lib/python3.7/site-packages/biothings_client/base.py in _dataframe(obj, dataframe, df_index)
    170                 df = DataFrame.from_dict(obj)
    171         if df_index:
--> 172             df = df.set_index("query")
    173         return df
    174 

~/anaconda3/envs/snaf/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/anaconda3/envs/snaf/lib/python3.7/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   5449 
   5450         if missing:
-> 5451             raise KeyError(f"None of {missing} are in the columns")
   5452 
   5453         if inplace:

KeyError: "None of ['query'] are in the columns"

I would greatly appreciate your guidance whenever it is convenient for you. Thank you for your kind assistance.

frankligy commented 6 months ago

Hi @infWang,

Sorry for the inconvenience you are encountering on your end, one of other users brought the exact same issue and what ended up happening for her is netMHCpan path was wrong, so basically netMHCpan was not properly run and there's actually no neoantigens in the output, that's why there's an pandas error because its an empty data frame.

So if my guess is right, right now you should have a few output file, but theburden3 file is actually empty (all zero in the text file). In that case, would you check if your netMHCpan path is properly set? Particuarly, See below for incorrect path as an example.

# incorrect
netMHCpan_path = '/user/ligk2e/netMHCpan-4.1

# correct
netMHCpan_path = '/user/ligk2e/netMHCpan-4.1/netMHCpan'

If this is not the issue, would you mind providing me with your code, stdout, stderr and how your current folder looks like, so I can help you debug? If you don't feel like sharing here, you can also directly email me (guangyuan.li@nyulangone.org).

Just let me know, Frank

infWang commented 6 months ago

@frankligy Thank you for your prompt response and helpful guidance.

I have checked the netMHCpan path and realized that it was indeed set incorrectly. After correcting the path as you suggested, the issue has been resolved. I appreciate your assistance in identifying the root cause of the problem.

Best regards

spvensko commented 6 months ago

Are there any other scenarios where this error may be encountered? I have gotten the same error when running both NetMHCpan and MHCflurry.

I am running on two melanoma patients from Hugo et al. 2016. SNAF reports 838 candidate neojunctions before the error:

WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_NXF_DEBUG as environment variable will not be supported in the future, use APPTAINERENV_NXF_DEBUG instead
/bin/bash: line 0: cd: /home/spvensko/dev-raft/projects/ots-splice-test/work/b7/d3c0914bde31b4c85a08b1bc440ecb: No such file or directory
Matplotlib created a temporary cache directory at /tmp/matplotlib-9qhtgorn because the default path (/home/spvensko/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-01-29 16:58:36 starting initialization
Current loaded gtex cohort with shape (13908, 2629)
Adding cohort tcga_control with shape (13398, 705) to the database
now the shape of control db is (14027, 3334)
Adding cohort gtex_skin with shape (12891, 313) to the database
now the shape of control db is (14027, 3647)
2024-01-29 17:00:20 finishing initialization
reduce valid NeoJunction from 14046 to 1319 because they are present in GTEx
reduce valid Neojunction from 1319 to 877 because they are present in added control tcga_control
reduce valid Neojunction from 877 to 838 because they are present in added control gtex_skin
frankligy commented 5 months ago

Are there any other scenarios where this error may be encountered? I have gotten the same error when running both NetMHCpan and MHCflurry.

I am running on two melanoma patients from Hugo et al. 2016. SNAF reports 838 candidate neojunctions before the error:

WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_TMPDIR as environment variable will not be supported in the future, use APPTAINERENV_TMPDIR instead
WARNING: DEPRECATED USAGE: Forwarding SINGULARITYENV_NXF_DEBUG as environment variable will not be supported in the future, use APPTAINERENV_NXF_DEBUG instead
/bin/bash: line 0: cd: /home/spvensko/dev-raft/projects/ots-splice-test/work/b7/d3c0914bde31b4c85a08b1bc440ecb: No such file or directory
Matplotlib created a temporary cache directory at /tmp/matplotlib-9qhtgorn because the default path (/home/spvensko/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-01-29 16:58:36 starting initialization
Current loaded gtex cohort with shape (13908, 2629)
Adding cohort tcga_control with shape (13398, 705) to the database
now the shape of control db is (14027, 3334)
Adding cohort gtex_skin with shape (12891, 313) to the database
now the shape of control db is (14027, 3647)
2024-01-29 17:00:20 finishing initialization
reduce valid NeoJunction from 14046 to 1319 because they are present in GTEx
reduce valid Neojunction from 1319 to 877 because they are present in added control tcga_control
reduce valid Neojunction from 877 to 838 because they are present in added control gtex_skin

Hi @spvensko,

This error, based on my current interactions with users, seem to be the results that the whole neoantigen prediction fails, so there's a empty dataframe at the end. Especially you mention right after the filtering step, you got this error, that seems that the neoantigen prediction doesn't work at all.

Like I mentioned, one reason is the path, but if you confirm this is not the issue, another scenario I just helped users debug is the HLA format, one of the users use format like A*02:01 instead of HLA-A*02:01, which can cause prediction fail as well.

If this is still not the case, would you mind sharing your code, stdout, stderr and how the result folder looks like before erroring out? So I can further look into that.

Best, Frank