frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
38 stars 8 forks source link

None of ['query'] are in the columns #32

Closed haoqing12 closed 6 months ago

haoqing12 commented 6 months ago

hi, your tool is wonderful and your tutorials are very detailed. However, I encountered this error when running the T antigen workflow.

In [14]: snaf.JunctionCountMatrixQuery.generate_results(path='./result/after_prediction.p',outdir='./result')
...:

adding gene symbol


KeyError Traceback (most recent call last)

in ----> 1 snaf.JunctionCountMatrixQuery.generate_results(path='./result/after_prediction.p',outdir='./result') ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in generate_results(path, outdir, criterion) 454 # add additional attributes 455 df = pd.read_csv(os.path.join(outdir,'frequency_stage{}_verbosity1_uid.txt'.format(stage)),sep='\t',index_col=0) --> 456 enhance_frequency_table(df,True,True,outdir,'frequency_stage{}_verbosity1_uid_gene_symbol_coord_mean_mle.txt'.format(stage)) 457 # report candidates 458 if stage == 3: ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in enhance_frequency_table(df, remove_quote, save, outdir, name) 1401 ''' 1402 print('adding gene symbol') -> 1403 df = add_gene_symbol_frequency_table(df=df,remove_quote=remove_quote) 1404 print('adding chromosome coordinates') [0/1885] 1405 df = add_coord_frequency_table(df,remove_quote=False) ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py in add_gene_symbol_frequency_table(df, remove_quote) 892 df['samples'] = [literal_eval(item) for item in df['samples']] 893 ensg_list = [item.split(',')[1].split(':')[0] for item in df.index] --> 894 symbol_list = ensemblgene_to_symbol(ensg_list,'human') 895 df['symbol'] = symbol_list 896 return df ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/downstream.py in ensemblgene_to_symbol(query, species) 920 import mygene 921 mg = mygene.MyGeneInfo() --> 922 out = mg.querymany(query,scopes='ensemblgene',fileds='symbol',species=species,returnall=True,as_dataframe=True,df_index=True) 923 924 df = out['out'] ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py in _querymany(self, qterms, scopes, **kwargs) 597 598 if dataframe: --> 599 out = self._dataframe(out, dataframe, df_index=df_index) 600 li_dup_df = DataFrame.from_records(li_dup, columns=["query", "duplicate hits"]) 601 li_missing_df = DataFrame(li_missing, columns=["query"]) ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/biothings_client/base.py in _dataframe(obj, dataframe, df_index) 170 df = DataFrame.from_dict(obj) 171 if df_index: --> 172 df = df.set_index("query") 173 return df 174 ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper ~/anaconda3/envs/SNAF/lib/python3.7/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity) 5449 5450 if missing: -> 5451 raise KeyError(f"None of {missing} are in the columns") 5452 5453 if inplace: KeyError: "None of ['query'] are in the columns" ================================ now, the result dir have these files: - after_prediction.p - burden_stage0.txt - burden_stage3.txt - frequency_stage0.txt - frequency_stage3.txt - frequency_stage3_verbosity1_uid.txt - NeoJunction_statistics_maxmin.txt - x_neoantigen_frequency_stage3.pdf - x_occurence_frequency_stage3.pdf Any help you can provide would be greatly appreciated.
frankligy commented 6 months ago

Hi @haoqing12,

Thanks for reaching out! This issue has been brought up before (https://github.com/frankligy/SNAF/issues/25), and as I illustrated in that thread, this error basically says the whole neoantigen prediction step fails, and all the stage3 files are actually empty dataframe, that's why there's a pandas error being throwing out.

In terms of why the prediction would fail, currently two things I observed are:

[1] make sure netMHCpan path is set correctly [2] make sure HLA allele format is correct

I have more detailed explanation in issue 25 (the link I shared above).

Would you mind checking if either of these apply to your case? And let me know if not, then we will dig further into that.

Best, Frank

haoqing12 commented 6 months ago

Thank you very much, I ignored the ‘HLA-’ prefix. Now, it‘s working.

But, I met a new error.

Traceback (most recent call last): File "", line 1, in File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 431, in run self.parallelize_run(kind=3,hlas=hlas) File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 499, in parallelize_run result = collect.get() File "/home/haoq/anaconda3/envs/SNAF/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value Exception: immunogenic prediction error: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

Have you experienced this problem?

frankligy commented 6 months ago

Hi @haoqing12,

I never ran SNAF or DeepImmuno on GPU node, because DeepImmuno is quite lightweight and can finish on CPU in seconds. I think the issue here is either since SNAF is memory-intensive (I am not proud of that), the GPU you were running doesn't have that much cpu memory, or you run DeepImmuno on GPU and it exceeds the GPU memory (I never tested how much GPU memory It will need if ran on GPU).

I guess my suggestion is to run on a CPU node and see if the issue will go away, in terms of CPU memory, I did a test for another issue brought up (https://github.com/frankligy/SNAF/issues/27) using netMHCpan. This can serve as a guidance for memory setup.

Let me know if the issue persists!

Best, Frank

haoqing12 commented 6 months ago

I am immensely grateful for your invaluable guidance. I've run through the whole pipeline, except for the B-cell antigen prediction.

Following your suggestion, I solved the memory problem by running the pipeline on a CPU node and adjusting the parameter t_min to 200.

However, I still have a few minor questions:

1, Interpretation of result/T_candidates/T_antigen_candidates_all.txt: I am curious about the 'in_db' column. Unfortunately, I couldn't find an explanation in the Troubleshooting section. Is it advisable to select only peptides where 'in_db' is True?

2, If I want to directly open a job that has already been run, what file do I need to load. when I ran visualizing(), I got -- NameError: name 'dict_exonCoords' is not defined. when I ran gtex_visual_combine_plotly(), I got -- NameError: name ' adata' is not defined

3, The detailed sequences about specific uid. I am eager to obtain novel amino acid/nucleotide sequences based on the UID for more detailed plotting and other predictions. Could you kindly direct me on the steps or files required for this task?

Thank you very, very much for your time and continuous support. I truly appreciate your dedication to assisting others.

Best regards, haoq

frankligy commented 6 months ago

Hi @haoqing12,

Glad you are able to get it to run and these are all good questions, I'll add the explanations to the doc as well.

[1] in_db means whether this splicing junction is present in one of the documented isoforms in the large database. We do not recommend filter by that, because a lot of isoforms documented so far are actually cancer or disease specific isoform. What we would suggest for those labelled as "in_db", is to go to the GTEx portal, look for exon/junction/isoform expression to see whether the transcript carrying this junction is actually expressed in normal tissue or not. But since SNAF has already filtered the junction based on their expression level in GTEx, most of the case, these are not highly detected in normal tissue.

[2] Sorry for not explaining that clearly, the issue is we still need to run the snaf.initialize() step to load in all the required global variables, so append the visualization code after, and because initiation takes some time, so I would recommend to write a list for all the Neojunction you'd like visualize and use a for loop, so you just need to initialize once and plot all:

# read in the splicing junction matrix
df = pd.read_csv('/user/ligk2e/altanalyze_output/ExpressionInput/counts.original.pruned.txt',index_col=0,sep='\t')

# database directory (where you extract the reference tarball file) and netMHCpan folder
db_dir = '/user/ligk2e/download'
netMHCpan_path = '/user/ligk2e/netMHCpan-4.1/netMHCpan'

# demonstrate how to add additional control database, see below note for more
tcga_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','tcga_matched_control_junction_count.h5ad'))
gtex_skin_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','gtex_skin_count.h5ad'))
add_control = {'tcga_control':tcga_ctrl_db,'gtex_skin':gtex_skin_ctrl_db}

# initiate
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control)

# visualize
jcmq = snaf.JunctionCountMatrixQuery.deserialize('result/after_prediction.p')
jcmq.visualize(uid='ENSG00000167291:E38.6-E39.1',sample='TCGA-DA-A1I1-06A-12R-A18U-07.bed',outdir='./result')

# interactive
snaf.gtex_visual_combine_plotly(uid=uid,outdir='result_new/common',norm=False,tumor=df)
# static
dff = snaf.gtex_visual_combine(uid=uid,outdir='Frank_inspection',norm=False,tumor=df,group_by_tissue=False)

[3] Let me know if what I am suggesting is not you were asking for, but if you are interesting in the flanking sequence, you can take a look at this issue (https://github.com/frankligy/SNAF/issues/31), if you are interested in obtaining the whole isoform sequence associated with Junction, you can refer to this issue (https://github.com/frankligy/SNAF/issues/22). Or if you have more customized demand, you can always take the chromosome coordinates reported in the result and query it in UCSC genome browser. If none of above is what you asked, let me know and I can further clarify.

Thank you, Frank

haoqing12 commented 6 months ago

Thank you again. You have given me two issues that are all I want. However, I am having some problems/concerns in the process:

截屏2024-03-04 21 29 43

for example "ENSG00000151632:E1.3-E5.1" From my perspective, the alternative splicing of AKR1C2 jumps directly from exon 1 unconventional splice site (E1.3) to exon 5. The junction locus on the chromosome is 5010653 (left) and 5017900 (right). I don't know if my understanding is correct.

issue#31, In your opinion, if the flanking sequence contains *, we will drop it.

截屏2024-03-04 21 31 59

But I found that the "FLHDVGRVL" presented in the result/T_candidates/T_antigen_candidates_all.txt belongs to the translation for phase = 2, and if I set MAX_PEP_LEN to 15, I find that this translation also contains *. MAX_PEP_LEN = 9 hides this codon.

issue#22, I got sr_ffl_str3_report_None_False.txt file. But I didn't find consistent UIDs in T_candidates/T_antigen_candidates_all.txt and sr_ffl_str3_report_None_False.txt. Is it because sr_ffl_str3_report_None_False can only get the B cell surface antigen's full length, not T-cell's?

Best regards, haoq

frankligy commented 6 months ago

Hi @haoqing12,

Both of your observations are correct and I can clarify:

[1] SNAF in silico translation is slightly different from the code I provided to (https://github.com/frankligy/SNAF/issues/31), that was meant to be a quick solution to check the flanking sequence. SNAF in silico translation rules are illustrated below using the specific example you brought up with:

Screenshot 2024-03-04 at 11 05 28 AM

That explains why the peptide FLHDVGRVL got predicted, as illustrated below, because E1.3 happens to not have stop codon:

Screenshot 2024-03-04 at 11 07 05 AM

But of course you will wonder, then how do you explain the preceding stop codon associated with this translation phase? My rationale (feel free to disagree) to include these in the output bases on the fact that a lot of tumor antigens are deriving from non-canonical short ORF with non-ATG start codon, I want to point you to a recently published paper (https://www.nature.com/articles/s41467-024-46240-9 ) and if you navigate to the supplemental figure 8, they showed a few concrete examples where such cryptic ORF exist from RiboSeq data. Similar evidence can be found in the nuORF paper (https://pubmed.ncbi.nlm.nih.gov/34663921/). Another evidence for that type of translation is this paper (https://pubmed.ncbi.nlm.nih.gov/18292810/), see figure6 below (this peptide is preceded by stop codon in that translation phase) and noting this peptide has gone to phase I clinical trial. (https://ascopubs.org/doi/10.1200/JCO.2024.42.4_suppl.435):

Screenshot 2024-03-04 at 11 12 50 AM

One thing I think we have to know is, an antigen doesn't need to be a functional stable protein to be able to presented by HLA, and the fact is it may be exact opposite, unstable degradable short peptides will rapidly be lysed by proteosome, which is exactly the mechanisms for presenting HLA-I presented peptides. So these peptides won't be detected using proteomes because they don't exist but will be detected in HLA-I pull-down immunopeptidome (https://aacrjournals.org/cancerimmunolres/article/8/8/1018/470266/Identification-of-the-Cryptic-HLA-I).

[2] For this issue, I think the reason is only subset of NeoJunction whose full-length isoform can be inferred from the present database and pass certain criteria (no NMD, transcript is annotated as protein-coding etc). For instance, the very example you showed, probably won't be able to have a precise full-length isoform prediction by SNAF because I can not see overlapping documented transcripts with that junction on UCSC genome browser. It doesn't necessarily mean no full-length isoform exist but just the current database is limited, full-length long-read sequencing may provide more rich dataset for us to better annotate such junctions.

Best, Frank

haoqing12 commented 6 months ago

Thank you very much for your explanation, especially the fact that stabilizing proteins are not necessary for antigen presentation, which has enlightened me. Regarding indel, I usually only chose NMD-escaped mutations as a source of neoantigens previously. Now I need to revisit this.

For SNAF, I have no more questions, and this issue can be closed.

Best regards, haoq