Open LIH-WHU opened 8 months ago
Hi @LIH-WHU,
Thanks for reaching out and nice to meet you!
I was trying to reproduce this error, so I just literally took the count matrix you shared here as the input test.full.txt
, I attached below:
And then I ran the code the same as you on my end:
#!/gpfs/data/yarmarkovichlab/Frank/test_snaf/test_snaf_env/bin/python3.7
import os,sys
import pandas as pd
import numpy as np
import snaf
import anndata as ad
# change the below two paths
db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan'
# initiate
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path)
# test something
jcmq = snaf.JunctionCountMatrixQuery(junction_count_matrix=df,cores=30,outdir='result_test')
hlas = [['HLA-A*02:01','HLA-A*02:01','HLA-B*39:10','HLA-B*15:01','HLA-C*03:03','HLA-C*12:03'],
['HLA-A*02:01','HLA-A*01:01','HLA-B*40:01','HLA-B*52:01','HLA-C*03:04','HLA-C*12:02']]
jcmq.run(hlas=hlas,outdir='result_test')
snaf.JunctionCountMatrixQuery.generate_results(path='./result_test/after_prediction.p',outdir='./result_test')
But I didn't encounter any errors and all the results are generated as expected:
Current loaded gtex cohort with shape (12, 2629)
2024-03-06 09:46:14 finishing initialization
reduce valid NeoJunction from 27 to 7 because they are present in GTEx
junction_count_matrix: (27, 2)
cores: 30
valid: 7
invalid: 20
cond_df: (27, 2)
subset: (7, 2)
translated: list of 7 nj objects
cond_subset_df: (7, 2)
results: list of length None
0it [00:00, ?it/s] | 0/1 [00:00<?, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1285.41it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1292.15it/s]
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1417.95it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|█████████████████████████████████████████████| 1/1 [00:02<00:00, 2.46s/it]
100%|█████████████████████████████████████████████| 1/1 [00:05<00:00, 5.47s/it]
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/gpfs/data/yarmarkovichlab/Frank/test_snaf/test_snaf_env/lib/python3.7/site-packages/snaf/dash_app/app.py", line 20, in clear_assets
imgs = os.listdir('assets')
FileNotFoundError: [Errno 2] No such file or directory: 'assets'
That error itself also puzzles me a lot, because it basically says the dict_fa
dictionary, which is keyed by ENSG ID and values are the sequence and other information, doesn't have the key "ENSG00000000419", but I just did a simple grep, and as you can see, it is here:
[lig08@bigpurple-ln3 yarmarkovichlab]$ cd Frank/SNAF_ecosystem/data/Alt91_db/
[lig08@bigpurple-ln3 Alt91_db]$ grep "ENSG00000000419" Hs_gene-seq-2000_flank.fa
>ENSG00000000419|20|50934867|50958555
At this moment, what I can suggest is double check if the initialization step is complete or not, whether all the required database are properly downloaded and loaded into RAM. Or if it has anything to do with the environment like if you are using jupyter sometimes the global variable will be messed up if not careful. You can do the below check:
db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan'
# initiate
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path)
print(snaf.snaf.dict_fa['ENSG00000000419']). # should be able to print out something
# or below
seq = snaf.snaf.subexon_tran(subexon='I2.1_50949267',EnsID='ENSG00000000419',flag='site1',code=False)
Let me know if you can just use my script snippet to get it to run, and keep me posted, now I am very curious about what have might happened here because I can not explain it.
Best, Frank
Thank you very much for your patient response!
This error seems to be caused by an incomplete database. Initially, the command grep ENSG00000000419" Hs_gene-seq-2000_flank.fa
did not return anything. After I re-downloaded the required database and ensured it was complete, the command grep ENSG00000000419" Hs_gene-seq-2000_flank.fa
return ENSG00000000419|20|50934867|50958555
as yours, and the program was able to run smoothly according to the tutorial.
Thank you once again for developing such an excellent tool.
Best wishes!
Sorry to bother you. Now I use SNAF in another dataset, this time I have confirmed that the required database is complete as the it runs well before. When running code
jcmq.run(hlas=hlas,outdir='/home/lih/snaf/result')
The error is as follows:
2024-09-15 17:47:55 starting initialization
Current loaded gtex cohort with shape (140210, 2629)
2024-09-15 17:50:46 finishing initialization
reduce valid NeoJunction from 144011 to 34443 because they are present in GTEx
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/.local/lib/python3.7/site-packages/snaf/snaf.py", line 1327, in subexon_tran
attrs = dict_exonCoords[EnsID][subexon] # [chr,strand,start,end,suffer]
KeyError: 'U0.1_27641085'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.local/lib/python3.7/site-packages/snaf/snaf.py", line 1363, in subexon_tran
attrs = dict_exonCoords[EnsID][subexon]
KeyError: 'U0.1'
Why is this error occurring and are there any clues for solution? Thank you very much!!!
Best, Hao
Hi @LIH-WHU,
Could you let me know the junction that caused this issue? So if you just grep "U0.1_27641085" for your junction count file, you should be able to get the full junction like "ENSG....:U0.1_27641085-something". Could you first send me this and then I'll think about what might be the issue here?
Thank you, Frank
Dear Frank,
Thank you for your prompt and kind response! I have run a grep command for U0.1_27641085 in the junction count file, and the output is as follows:
Then, I performed a grep on ENSG00000000938 in the reference data, and the output is as follows:
I would be extremely grateful if you could kindly assist us with this matter. We deeply appreciate your help!
Wishing you a happy Mid-Autumn Festival!
Best regards, Hao
Hi @LIH-WHU,
If you do the following:
# change the below two paths
db_dir = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/data'
netMHCpan_path = '/gpfs/data/yarmarkovichlab/Frank/SNAF_ecosystem/netMHCpan-4.1/netMHCpan'
# initiate
tcga_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','tcga_matched_control_junction_count.h5ad'))
gtex_skin_ctrl_db = ad.read_h5ad(os.path.join(db_dir,'controls','gtex_skin_count.h5ad'))
add_control = {'tcga_control':tcga_ctrl_db,'gtex_skin':gtex_skin_ctrl_db}
df = pd.read_csv('test.full.txt',sep='\t',index_col=0)
snaf.initialize(df=df,db_dir=db_dir,binding_method='netMHCpan',software_path=netMHCpan_path,add_control=add_control)
# just test one junction
nj = snaf.NeoJunction(uid='ENSG00000000938:U0.1_27641085-E4.1',count=50,check_gtex=False)
nj.detect_type()
nj.retrieve_junction_seq()
print(nj.junction)
I got this, which is the junction sequence:
(base) [lig08@a100-4002 test_snaf]$ ./troubleshoot.py
2024-09-17 11:44:23 starting initialization
Current loaded gtex cohort with shape (13, 2629)
Adding cohort tcga_control with shape (13, 705) to the database
now the shape of control db is (14, 3334)
Adding cohort gtex_skin with shape (9, 313) to the database
now the shape of control db is (14, 3647)
2024-09-17 11:44:42 finishing initialization
CTCACATTCCCTCTATATTTATTAGTGGGCATTTTCTTTCTTTCTTTCTTTTTCTTATAGTTCTGGAGCCTGGGAAGTCCAAGATCAAGGTGCTGGCAAG,GTCTCTGACCCCTCCCAAGGATCATGCCGCAGCCCCACTGACCCAGGAGTAGGGGCCTAAGGG
Could you be able to obtain the junction sequence for this junction? Not sure if it is the issue, but make sure you have internet connection as for alternative promoter, the program will request from UCSC genome browser to get the DNA sequence.
I also used the test.full.txt
file (last row is the one I appended the this splicing junction) to successfully run through the program, so let me know what turns out on your end for the first test.
Best, Frank
Also I wonder if the Traceback you pasted here is the full error? Any chance there were more traceback after what you have pasted here?
Thank you for developing this excellent tool! I am also from Wuhan, China. Congratulatations on your remarkable achievements! I encountered the following issue while following the tutorial for 'Running the T antigen workflow' and sincerely invite your assistance. Running the following code poses no issues:
Next, proceed with the following step:
jcmq.run(hlas=hlas,outdir='./09.result')
The error is as follows:
I have checked the counts.original.full.txt file. It look like this:
If I remove the line
ENSG00000000419:I2.1_50949267-I2.1_50948941 0 22
the code can continue to run. However, it will encounter a similar error and get stuck at the lineENSG00000000419:I5.1_50945529-I5.1_50944551 29 59
. Approximately, every 200 lines, there is one line that causes thejcmq.run
to fail. The following lines also lead to similar issues:I would like to ask for your advice: Why is this error occurring? Apart from ignoring these lines (there are too many of them to delete one by one), is there a better solution? Thank you very much!