frankligy / SNAF

Splicing Neo Antigen Finder (SNAF) is an easy-to-use Python package to identify splicing-derived tumor neoantigens from RNA sequencing data, it further leverages both deep learning and hierarchical Bayesian models to prioritize certain candidates for experimental validation
MIT License
44 stars 8 forks source link

When jcmq.run, error in subexon_tran #33

Open LIH-WHU opened 8 months ago

LIH-WHU commented 8 months ago

Thank you for developing this excellent tool! I am also from Wuhan, China. Congratulatations on your remarkable achievements! I encountered the following issue while following the tutorial for 'Running the T antigen workflow' and sincerely invite your assistance. Running the following code poses no issues:

#read in the splicing junction matrix
df = pd.read_csv('../09.SNAF/altanalyze_output/ExpressionInput/counts.original.full.txt',index_col=0,sep='\t')

#database directory (where you extract the reference tarball file) and netMHCpan folder
netMHCpan_path='/data/person/lab/lih/.conda/envs/SNAF/lib/R/library/netmhcpan/netMHCpan-4.1/netmhcpan'
db_dir = '../09.SNAF/data/'

#initiate
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path)

#Running the T antigen workflow
jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=30,outdir='09.result')
sample_to_hla = pd.read_table('hla.csv',sep=',',index_col=0)['hla'].to_dict()
hlas = [hla_string.split(',') for hla_string in df.columns.map(sample_to_hla)]

Next, proceed with the following step:

jcmq.run(hlas=hlas,outdir='./09.result')

The error is as follows:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 1333, in subexon_tran
    attrs = dict_exonCoords[EnsID][subexon]  # [chr,strand,start,end,suffer]
KeyError: 'I2.1_50949267'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 360, in each_chunk_func
    nj.retrieve_junction_seq()
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 1025, in retrieve_junction_seq
    seq1 = subexon_tran(subexon1,ensid,'site1',code)
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 1383, in subexon_tran
    exon_seq = query_from_dict_fa(suffix,attrs[3],EnsID,attrs[1])
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py", line 1249, in query_from_dict_fa
    start = int(dict_fa[EnsID][1])
KeyError: 'ENSG00000000419'
"""

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_93207/198601986.py in <module>
----> 1 jcmq.run(hlas=hlas,outdir='./09.result')

/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in run(self, hlas, strict, outdir, name)
    427             jcmq.run(hlas=hlas,outdir='result',name='after_prediction.p')
    428         '''
--> 429         self.parallelize_run(kind=1,strict=strict)
    430         print(self)
    431         self.parallelize_run(kind=3,hlas=hlas)

/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/snaf.py in parallelize_run(self, kind, hlas, strict)
    486             results = []
    487             for collect in r:
--> 488                 result = collect.get()
    489                 results.extend(result)
    490             self.translated = results

/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

KeyError: 'ENSG00000000419'

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/data/person/lab/lih/.conda/envs/SNAF/lib/python3.7/site-packages/snaf/dash_app/app.py", line 20, in clear_assets
    imgs = os.listdir('assets')
FileNotFoundError: [Errno 2] No such file or directory: 'assets'

I have checked the counts.original.full.txt file. It look like this:

B1S2.bed    B1S1.bed
ENSG00000000003:E11.1-E13.1 1233    1960
ENSG00000000003:E13.1_100629632-E13.3_100629288 18  0
ENSG00000000003:E13.1_100629716-E13.3_100629164 0   15
ENSG00000000003:E13.1_100629717-E13.3_100629330 0   6
ENSG00000000003:E13.1_100629744-E13.3_100629306 0   15
ENSG00000000003:E13.1_100629833-E13.1_100629800 9   1
ENSG00000000003:E13.1_100629872-E13.1_100629794 11  11
ENSG00000000003:E13.1_100629895-E13.1_100629849 11  13
ENSG00000000003:E13.3_100629513-E13.3_100629133 6   0
ENSG00000000003:E13.4-I13.1 20  0
ENSG00000000003:E3.4-E4.1   431 701
ENSG00000000003:E4.1-E5.1   493 657
ENSG00000000003:E5.1-E6.1   546 538
ENSG00000000003:E6.1-E7.1   267 248
ENSG00000000003:E7.1_100633495-E13.3_100629239  7   0
ENSG00000000003:E7.1_100633525-E13.1_100629779  7   0
ENSG00000000003:E7.2-E11.1  0   5
ENSG00000000003:E7.2-E9.1   247 368
ENSG00000000003:E9.2-E11.1  375 562
......

If I remove the line ENSG00000000419:I2.1_50949267-I2.1_50948941 0 22 the code can continue to run. However, it will encounter a similar error and get stuck at the line ENSG00000000419:I5.1_50945529-I5.1_50944551 29 59. Approximately, every 200 lines, there is one line that causes the jcmq.runto fail. The following lines also lead to similar issues:

ENSG00000000419:I2.1_50949267-I2.1_50948941 0   22
ENSG00000000419:I5.1_50945529-I5.1_50944551 29  59
ENSG00000001167:E10.1_41097627-E10.2_41099783   4   33
ENSG00000001167:E10.1_41097681-E10.1_41097777   0   23
ENSG00000001167:E10.2_41098898-ENSG00000161912:I4.1_41102182    31  0
ENSG00000001497:E13.1-U15.1_65499159    4   57
ENSG00000001497:E14.1-U15.1_65499159    11  0
ENSG00000001497:E13.1_65518042-E15.1_65512891   0   23
......

I would like to ask for your advice: Why is this error occurring? Apart from ignoring these lines (there are too many of them to delete one by one), is there a better solution? Thank you very much!

frankligy commented 8 months ago

Hi @LIH-WHU,

Thanks for reaching out and nice to meet you!

I was trying to reproduce this error, so I just literally took the count matrix you shared here as the input test.full.txt, I attached below:

test.full.txt

And then I ran the code the same as you on my end:

#!/gpfs/data/yarmarkovichlab/Frank/test_snaf/test_snaf_env/bin/python3.7

import os,sys
import pandas as pd
import numpy as np
import snaf
import anndata as ad

# change the below two paths
db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan' 

# initiate
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path)

# test something
jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=30,outdir='result_test')
hlas = [['HLA-A*02:01','HLA-A*02:01','HLA-B*39:10','HLA-B*15:01','HLA-C*03:03','HLA-C*12:03'],
        ['HLA-A*02:01','HLA-A*01:01','HLA-B*40:01','HLA-B*52:01','HLA-C*03:04','HLA-C*12:02']]
jcmq.run(hlas=hlas,outdir='result_test')
snaf.JunctionCountMatrixQuery.generate_results(path='./result_test/after_prediction.p',outdir='./result_test')

But I didn't encounter any errors and all the results are generated as expected:

Current loaded gtex cohort with shape (12, 2629)
2024-03-06 09:46:14 finishing initialization
reduce valid NeoJunction from 27 to 7 because they are present in GTEx
junction_count_matrix: (27, 2)
cores: 30
valid: 7
invalid: 20
cond_df: (27, 2)
subset: (7, 2)
translated: list of 7 nj objects
cond_subset_df: (7, 2)
results: list of length None
0it [00:00, ?it/s]                                        | 0/1 [00:00<?, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1285.41it/s]

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1292.15it/s]

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1417.95it/s]
0it [00:00, ?it/s]

0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:02<00:00,  2.46s/it]
100%|█████████████████████████████████████████████| 1/1 [00:05<00:00,  5.47s/it]
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/gpfs/data/yarmarkovichlab/Frank/test_snaf/test_snaf_env/lib/python3.7/site-packages/snaf/dash_app/app.py", line 20, in clear_assets
    imgs = os.listdir('assets')
FileNotFoundError: [Errno 2] No such file or directory: 'assets'

That error itself also puzzles me a lot, because it basically says the dict_fa dictionary, which is keyed by ENSG ID and values are the sequence and other information, doesn't have the key "ENSG00000000419", but I just did a simple grep, and as you can see, it is here:

[lig08@bigpurple-ln3 yarmarkovichlab]$ cd Frank/SNAF_ecosystem/data/Alt91_db/
[lig08@bigpurple-ln3 Alt91_db]$ grep "ENSG00000000419" Hs_gene-seq-2000_flank.fa 
>ENSG00000000419|20|50934867|50958555

At this moment, what I can suggest is double check if the initialization step is complete or not, whether all the required database are properly downloaded and loaded into RAM. Or if it has anything to do with the environment like if you are using jupyter sometimes the global variable will be messed up if not careful. You can do the below check:

db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan' 

# initiate
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path)
print(snaf.snaf.dict_fa['ENSG00000000419']). # should be able to print out something
# or below
seq = snaf.snaf.subexon_tran(subexon='I2.1_50949267',EnsID='ENSG00000000419',flag='site1',code=False)

Let me know if you can just use my script snippet to get it to run, and keep me posted, now I am very curious about what have might happened here because I can not explain it.

Best, Frank

LIH-WHU commented 8 months ago

Thank you very much for your patient response! This error seems to be caused by an incomplete database. Initially, the command grep ENSG00000000419" Hs_gene-seq-2000_flank.fa did not return anything. After I re-downloaded the required database and ensured it was complete, the command grep ENSG00000000419" Hs_gene-seq-2000_flank.fa return ENSG00000000419|20|50934867|50958555 as yours, and the program was able to run smoothly according to the tutorial. Thank you once again for developing such an excellent tool. Best wishes!

LIH-WHU commented 2 months ago

Sorry to bother you. Now I use SNAF in another dataset, this time I have confirmed that the required database is complete as the it runs well before. When running code jcmq.run(hlas=hlas,outdir='/home/lih/snaf/result')

The error is as follows:

2024-09-15 17:47:55 starting initialization
Current loaded gtex cohort with shape (140210, 2629)
2024-09-15 17:50:46 finishing initialization
reduce valid NeoJunction from 144011 to 34443 because they are present in GTEx
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/snaf/snaf.py", line 1327, in subexon_tran
    attrs = dict_exonCoords[EnsID][subexon]  # [chr,strand,start,end,suffer]
KeyError: 'U0.1_27641085'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/snaf/snaf.py", line 1363, in subexon_tran
    attrs = dict_exonCoords[EnsID][subexon]
KeyError: 'U0.1'

Why is this error occurring and are there any clues for solution? Thank you very much!!!

Best, Hao

frankligy commented 2 months ago

Hi @LIH-WHU,

Could you let me know the junction that caused this issue? So if you just grep "U0.1_27641085" for your junction count file, you should be able to get the full junction like "ENSG....:U0.1_27641085-something". Could you first send me this and then I'll think about what might be the issue here?

Thank you, Frank

LIH-WHU commented 2 months ago

Dear Frank,

Thank you for your prompt and kind response! I have run a grep command for U0.1_27641085 in the junction count file, and the output is as follows:

图片

Then, I performed a grep on ENSG00000000938 in the reference data, and the output is as follows:

图片

I would be extremely grateful if you could kindly assist us with this matter. We deeply appreciate your help!

Wishing you a happy Mid-Autumn Festival!

Best regards, Hao

frankligy commented 2 months ago

Hi @LIH-WHU,

If you do the following:

test.full.txt

# change the below two paths
db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan' 

# initiate
tcga_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','tcga_matched_control_junction_count.h5ad'))
gtex_skin_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','gtex_skin_count.h5ad'))
add_control = {'tcga_control':tcga_ctrl_db,'gtex_skin':gtex_skin_ctrl_db}
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control)

# just test one junction
nj = snaf.NeoJunction(uid='ENSG00000000938:U0.1_27641085-E4.1',count=50,check_gtex=False)
nj.detect_type()
nj.retrieve_junction_seq()
print(nj.junction)

I got this, which is the junction sequence:

(base) [lig08@a100-4002 test_snaf]$ ./troubleshoot.py 
2024-09-17 11:44:23 starting initialization
Current loaded gtex cohort with shape (13, 2629)
Adding cohort tcga_control with shape (13, 705) to the database
now the shape of control db is (14, 3334)
Adding cohort gtex_skin with shape (9, 313) to the database
now the shape of control db is (14, 3647)
2024-09-17 11:44:42 finishing initialization
CTCACATTCCCTCTATATTTATTAGTGGGCATTTTCTTTCTTTCTTTCTTTTTCTTATAGTTCTGGAGCCTGGGAAGTCCAAGATCAAGGTGCTGGCAAG,GTCTCTGACCCCTCCCAAGGATCATGCCGCAGCCCCACTGACCCAGGAGTAGGGGCCTAAGGG

Could you be able to obtain the junction sequence for this junction? Not sure if it is the issue, but make sure you have internet connection as for alternative promoter, the program will request from UCSC genome browser to get the DNA sequence.

I also used the test.full.txt file (last row is the one I appended the this splicing junction) to successfully run through the program, so let me know what turns out on your end for the first test.

Best, Frank

frankligy commented 2 months ago

Also I wonder if the Traceback you pasted here is the full error? Any chance there were more traceback after what you have pasted here?