Closed drob2727 closed 2 years ago
Seem the format of your files in the 'bacteria' folder is incorrect. Then, BLASTN cannot run with the command "blastn -query out/query.fa -db blast_db/bin -outfmt 6 -out blast_tab/bin.tab -num_threads 8".
You can check whether you can build the database using " makeblastdb -in bacteria/bin.fasta -dbtype nucl -parse_seqids -out blast_db/bin" and run BLASTN with your virus file.
I see, so its because my files end in .fa instead of .fasta. If I fixed the file extension, can I resume from where I left off or do I have to start over?
No, it is not because of your file extension. The program can automatically read it. I mean you need to try to use makeblastdb and blastn command to debug. Because it seems the BLAST cannot run correctly.
'bin.fasta' is just an example of your input prokaryotic genomes. Since I do not know your files
Building a new DB, current time: 01/16/2022 09:51:20 New DB name: /lustre/scratch/usr/fslcollab273/HostG/blast_db/bin New DB title: bacteria/bin.1408.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 11 sequences in 0.048352 seconds
It looks like it can make the database so should I do this for all of my bins?
How do I run the BlastN command? Thank you so much for the help
there are two dot(.) in your file name. Please use the same as the example file. accession.fa or accession.fasta. you can use dash line (_) to replace the first one.
Also make sure your sequence in your virus fasta has at least one sequence whose length is longer than 8000bp if you use Len=8000
After solving the problem above, run two command below to check whether the BLAST can be run correctly.
makeblastdb -in bacteria/bin_1408.fasta -dbtype nucl -parse_seqids -out blast_db/bin_1408
blastn -query query.fa -db blast_db/bin_1408 -outfmt 6 -out blast_tab/bin_1408.tab -num_threads 8
query.fa is your input file.
if the are ok, then you can rerun the HostG
It worked!
(HostG) -bash-4.2$ cat bin_1408.tab VF_VS_2402 k141_959580_length_146859_cov_1634.8330 93.846 130 7 1 8881 9009 1952 2081 9.83e-50 195 VF_VS_7541 k141_959580_length_146859_cov_1634.8330 100.000 52 0 0 200 251 71 20 1.72e-20 97.1 VF_VS_7852 k141_995934_length_49184_cov_1635.8327 81.558 385 63 7 329 711 48805 49183 5.55e-85 311 VF_VS_13436 k141_959580_length_146859_cov_1634.8330 100.000 141 0 0 1 141 141 1 4.60e-70 261 VF_VS_13436 k141_959580_length_146859_cov_1634.8330 96.296 135 5 0 2290 2424 154 20 2.17e-58 222 VF_VS_13847 k141_8453456_length_35143_cov_1469.2361 77.815 302 65 2 26 326 22545 22845 2.81e-47 185 VF_VS_14898 k141_959580_length_146859_cov_1634.8330 90.698 43 4 0 3552 3594 9502 9544 6.22e-09 58.4 VF_VS_20168 k141_959580_length_146859_cov_1634.8330 98.113 53 1 0 3451 3503 19 71 1.51e-19 93.5 VF_VS_34343 k141_995934_length_49184_cov_1635.8327 80.097 206 38 3 1 205 49097 48894 7.19e-37 150
Thats what I got, Does that look right? Should I start the process all over by deleting newly generated files?
Then, I think the program should be ok this time.
you do not need to remove them by yourself. Just rerun it should be ok
Im having a hard time to understand the OSError of disk quota exceeded. I ran this with 1500 GB of Ram and I have TB of space left in my drive. Any ideas?
Traceback (most recent call last):
File "run_phage_phage.py", line 238, in
Ummmmmm.
If you are using an HPC, you can check it out with the command "quota". Since there are two types of 'Disk quota exceeded': files or size. You might be able to find out something.
However, if you are using your own computer, I am afraid that I do not know much about this.
A simple way to test is using a smaller file. For example, you can split your file into a smaller one to run as an example and check whether there exist any bugs.
Creating blast database...
Running blastn...
Creating blast database...
Running blastn...
run_KnowledgeGraph.py:100: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
node_feature = np.array(node_feature)
folder GCN_data exist... cleaning dictionary
Traceback (most recent call last):
File "run_KnowledgeGraph.py", line 155, in
I'm having a slightly different knowledge graph error. Any ideas?
It seems you forget to add the taxonomy labels of your prokaryotes in dataset/label.csv.
BTW, according to your previous description, you may not know the taxonomy labels of your bins. In this situation, you can directly use names like "bin_1408" as their labels. However, we cannot ensure the model can work well. This is because HostG is a multi-class classification classifier, if the number of samples is very small, it might not be able to learn from them. But you can try.
Another solution is using our newly developed method called: Cherry. We tried to fix the aforementioned problem in this new work. The usage is nearly the same but has a new formulation. You can follow the guidelines and run it at the same time.
Best, Jiayu
I’ll look through it, but I do have taxonomy in the label.csv file. Maybe it’s better to just have the contigs match up with the bin names and then use my own taxonomy file since it doesn’t seems to be working. I’ll go through it again and see if there’s any discrepancies between my taxonomy file and what it is supposed to me. I’ll send a screenshot.
Thank you so much
Get Outlook for iOShttps://aka.ms/o0ukef
From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:17:17 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)
[EXTERNAL]
It seems you forget to add the taxonomy labels of your prokaryotes in dataset/label.csv.
BTW, according to your previous description, you may not know the taxonomy labels of your bins. In this situation, you can directly use names like "bin_1408" as their labels. However, we cannot ensure the model can work well. This is because HostG is a multi-class classification classifier, if the number of samples is very small, it might not be able to learn from them. But you can try.
Another solution is using our newly developed method called: Cherryhttps://github.com/KennthShang/CHERRY. We tried to fix the aforementioned problem in this new work. The usage is nearly the same, But we a new formulation. You can follow the guidelines and run it at the same time.
Best, Jiayu
— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023198492, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMI55NXTXP5J2UU3GUTUYFAV3ANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>
I see.
The program will treat the file names (except for the '.fasta') in the 'bacteria/' folder as the accession names.
For example, in the program
tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0]
node.split('.')[0] is bin_1408, then it will search label.csv to return its labels.
Sorry for the inconvenience. Maybe you can also try a small sample for testing first. It will save time to construct the knowledge graph (which is an O(N^2) algorithm.
Best, Jiayu
Makes sense to me. The way it's set up right now is like that if I'm following correctly, it says bin_1408 and then the taxonomy. Is that wrong?
Get Outlook for iOShttps://aka.ms/o0ukef
From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:36:22 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)
[EXTERNAL]
I see.
The program will treat the file names (except for the '.fasta') in the 'bacteria/' folder as the accession names.
For example, in the program
tmpL = label_df[label_df['accession'] == node.split('.')[0]][taxa].values[0]
node.split('.')[0] is bin_1408, then it will search label.csv to return its labels.
Sorry for the inconvenience. Maybe you can also try a small sample for testing first. It will save time to construct the knowledge graph (which is an O(N^2) algorithm.
Best, Jiayu
— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023217835, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMKD3MVUMLHRPNMUPJLUYFC5NANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>
Yes. So what you need to add into the label.csv is like (just an example)
accession,phylum,class,order,family,genus
bin_1408,Proteobacteria,Betaproteobacteria,Burkholderiales,Alcaligenaceae,Achromobacter
accession,phylum,class,order,family,genus is the header, no need to add this line.
Best, Jiayu
Ok gotcha. I will recheck my file when I get done with teaching today and I will verify. I really appreciate the help. We’re predicting hosts for our viruses from Antarctica. We’ve predicted through NCBI but there’s nothing better than using our own bacterial contigs. I really appreciate all the help. If we can get this through, I look forward to using this great tool in multiple papers
Get Outlook for iOShttps://aka.ms/o0ukef
From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:42:02 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)
[EXTERNAL]
Yes. So what you need to add into the label.csv is like (just an example)
accession,phylum,class,order,family,genus bin_1408,Proteobacteria,Betaproteobacteria,Burkholderiales,Alcaligenaceae,Achromobacter
Best, Jiayu
— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023222565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMO7IL5OAIKSWXWFG5TUYFDSVANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>
Thanks a lot.
BTW, maybe you also want to use our new method, Cherry, in the future (although it is still under review now). In our latest experiments, it can return labels down to species level and has a more precise accuracy.
Best, Jiayu
Oh wow, I’ll take a look at it this afternoon
Get Outlook for iOShttps://aka.ms/o0ukef
From: Kenneth Shang @.> Sent: Thursday, January 27, 2022 6:51:09 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)
[EXTERNAL]
Thanks a lot.
BTW, maybe you also want to use our new method, Cherry, in the future (although it is still under review now). In our latest experiments, it can return labels down to species level and has a more precise accuracy.
Best, Jiayu
— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1023230998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMMW3YH3IP3BBMS7APLUYFEU3ANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>
Everything look great after I fixed the names until this
run_KnowledgeGraph.py:100: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
node_feature = np.array(node_feature)
folder GCN_data exist... cleaning dictionary
Traceback (most recent call last):
File "run_KnowledgeGraph.py", line 155, in
It seems there still exists the same problem.
Could you send me some samples for testing? For example, can you share some of your bins (4 or 5 of them?), if they are not private and too large. Then, I can test it on my HPC.
If you want, you can send them (or the link) to my email: jyshang2-c@my.cityu.edu.hk
Best, Jiayu
Hi, I have checked your data, it seems everything works well. The output of the examples like below:
Please check whether you have added the taxa like below (in label.csv):
Since I do not know the exact taxa, I use temp instead. You can also run the example files you sent me to test whether the program can work correctly on your PC/HPC
Best, Jiayu
Thank you so much for looking at that. I think it has to be my label.csv file that's wrong. Ill send a picture
From: Kenneth Shang @.> Sent: Wednesday, February 2, 2022 2:37 AM To: KennthShang/HostG @.> Cc: David Robinson @.>; Author @.> Subject: Re: [KennthShang/HostG] Knowledge graph and run_Speed_up.py Error (Issue #7)
[EXTERNAL]
Hi, I have checked your data, it seems everything works well. The output of the examples like below: [image]https://user-images.githubusercontent.com/22445402/152128790-05ae337f-7cd2-4fa7-8a8f-28a3cc965582.png
Please check whether you have added the taxa like below (in label.csv): [image]https://user-images.githubusercontent.com/22445402/152128375-92cc83c6-60a9-413f-a495-09ff3349c0b8.png
Since I do not know the exact taxa, I use temp instead. You can also run the example files you sent me to test whether the program can work correctly on your PC/HPC
Best, Jiayu
— Reply to this email directly, view it on GitHubhttps://github.com/KennthShang/HostG/issues/7#issuecomment-1027748389, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APJSNMIBT2CPQUEV37QB3DDUZD3PDANCNFSM5MC5SLDQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.Message ID: @.***>
This is what my label.csv file looks like. Is it the NA and no support thats giving me trouble? I tried to fix a couple things and then resubmit and it still had problems.
Traceback (most recent call last): File "run_phagehost.py", line 64, in
= subprocess.check_call(blast_cmd, shell=True)
File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'blastn -query out/query.fa -db blast_db/bin -outfmt 6 -out blast_tab/bin.tab -num_threads 8' returned non-zero exit status 2.
folder Cyber_data/ exist... cleaning dictionary
Cannot clean your folder... permission denied
cat: pred/: No such file or directory
folder input exist... cleaning dictionary
Dictionary cleaned
folder pred exist... cleaning dictionary
folder Split_files exist... cleaning dictionary
Dictionary cleaned
folder tmp_pred exist... cleaning dictionary
Knowledge Graph Error for file contig_0
Knowledge Graph Error for file contig_1
Knowledge Graph Error for file contig_2
Knowledge Graph Error for file contig_3
phage_host Error for file contig_4
Pre-trained CNN Error for file contig_5
Traceback (most recent call last):
File "run_Speed_up.py", line 157, in
out = subprocess.check_call(cmd, shell=True)
File "/fslhome/fslcollab273/.conda/envs/Host/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'cat pred/ > final_prediction.csv' returned non-zero exit status 1.
We were so close I think. Any ideas?